Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRv4 hangs on orion and hercules #2486

Open
RuiyuSun opened this issue Oct 30, 2024 · 97 comments
Open

HRv4 hangs on orion and hercules #2486

RuiyuSun opened this issue Oct 30, 2024 · 97 comments
Labels
bug Something isn't working

Comments

@RuiyuSun
Copy link
Contributor

RuiyuSun commented Oct 30, 2024

George V. noticed that The HRv4 does not work on Hercules or Orion. It hangs sometime after WW3 starts. No relevant message in the log files about the hanging.

To Reproduce: Run an HRv4 experiment on Hercules or Orion

Additional context

Output

@RuiyuSun RuiyuSun added the bug Something isn't working label Oct 30, 2024
@GeorgeVandenberghe-NOAA
Copy link
Collaborator

This happens at high ATM resolution C1152.

@RuiyuSun
Copy link
Contributor Author

RuiyuSun commented Nov 4, 2024

I made a HRv4 test run on orion as well. As reported previously, it hung at the beginning of the run.

The log file is at /work2/noaa/stmp/rsun/ROTDIRS/HRv4

HOMEgfs=/work/noaa/global/rsun/git/global-workflow.hr.v4 (source)
EXPDIR=/work/noaa/global/rsun/para_gfs/HRv4
COMROOT=/work2/noaa/stmp/rsun/ROTDIRS
RUNDIRS=/work2/noaa/stmp/rsun/RUNDIRS

@LarissaReames-NOAA
Copy link
Collaborator

@RuiyuSun Denise reports that the privacy settings on your directories are preventing her from accessing them. Could you check on that and report back when it's fixed so others can look at your forecast?

@RuiyuSun
Copy link
Contributor Author

RuiyuSun commented Nov 5, 2024

@DeniseWorthen I made the changes. Please try again.

@JessicaMeixner-NOAA
Copy link
Collaborator

I've made a few test runs on my end and here are some observations:

  • This also fails at C768 S2SW
  • This fails at C1152 S2S (so I do not think this is wave-grid related).

Consistently all runs I have made, also the same as @RuiyuSun runs stall out here:

    0:  fcst_initialize total time:    200.367168849800
    0:  fv3_cap: field bundles in fcstComp export state, FBCount=            8
    0:  af allco wrtComp,write_groups=           4
 9216: NOTE from PE     0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to    32768.
 9216:  &MPP_IO_NML
 9216:  HEADER_BUFFER_VAL       =       16384,
 9216:  GLOBAL_FIELD_ON_ROOT_PE = T,
 9216:  IO_CLOCKS_ON    = F,
 9216:  SHUFFLE =           0,
 9216:  DEFLATE_LEVEL   =          -1,
 9216:  CF_COMPLIANCE   = F
 9216:  /
 9216: NOTE from PE     0: MPP_IO_SET_STACK_SIZE: stack size set to     131072.
 9216: NOTE from PE     0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 16000000.
 9216:  num_files=           2
 9216:  num_file=           1 filename_base= atm output_file= netcdf_parallel
 9216:  num_file=           2 filename_base= sfc output_file= netcdf_parallel
 9216:  grid_id=            1  output_grid= gaussian_grid
 9216:  imo=        4608 jmo=        2304
 9216:  ideflate=           1
 9216:  quantize_mode=quantize_bitround quantize_nsd=           5
 9216:  zstandard_level=           0
    0:  af wrtState reconcile, FBcount=           8
    0:  af get wrtfb=output_atm_bilinear rc=           0

With high resolution runs (C768 & C1152) for various machines we've had to use different number of write grid tasks. I've tried a few and all are stalling though. This is using ESMF managed threading, so one thing to try might be moving away from that?

To run a high res test case:

git clone --recursive https://github.com/NOAA-EMC/global-workflow
cd global-workflow/sorc
./build_all.sh
./link_workflow.sh
cd ../../
mkdir testdir 
cd testdir 
source ../global-workflow/workflow/gw_setup.sh 
HPC_ACCOUNT=marine-cpu pslot=C1152t02 RUNTESTS=`pwd` ../global-workflow/workflow/create_experiment.py --yaml ../global-workflow/ci/cases/hires/C1152_S2SW.yaml

Change C1152 to C768 to run that resolution and also change your HPC_ACCOUNT, pslot, as desired. Lastly, if you want to turn off waves, you change that in C1152_S2SW.yaml. If you want to change resources, look in global-workflow/parm/config/gfs/config.ufs in the C768/C1152 section.

If you want to run S2S only, change the app in global-workflow/ci/cases/hires/C1152_S2SW.yaml

My latest run log files can be found at:
/work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t0*/COMROOT/C1152t0*/logs/2019120300/gfs_fcst_seg0.log
(several runs are in progress, but they've all been running for over an hour an all hung on the same spot, despite changing write grid tasks).

@JessicaMeixner-NOAA
Copy link
Collaborator

@GeorgeVandenberghe-NOAA suggested trying 2 write groups with 240 tasks in them. I meant to try that but tried 2 write groups with 360 tasks per group unintentionally, but I did turn on all PET files as @LarissaReames-NOAA thought that might have helpful info.

The rundirectory is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800

The log file is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t06/COMROOT/C1152t06/logs/2019120300/gfs_fcst_seg0.log

The PET logs to me also point to write group issues. Any help with this would be greatly appreciated.

Tagging @aerorahul for awareness.

@JacobCarley-NOAA
Copy link

Thanks to everyone for the work on this. Has anyone tried this configuration with the write component off? That might help isolate where there problem is (hopefully) and then we can direct this accordingly for further debugging.

@JessicaMeixner-NOAA
Copy link
Collaborator

I have not tried this without the write component.

@DusanJovic-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA and others, I grabbed the run directory from the last experiment you ran (/work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800), changed it to run just ATM component and converted it to run with traditional threading. It is currently running in /work2/noaa/stmp/djovic/stmp/fcst.272800, and it passed the initialization phase and finished writing 000 and 003 hour outputs successfully. I submitted the job with just 30 min wall-clock time limit, so it will fail soon.

I suggest you try running full coupled version with traditional threading if it's easy to reconfigure it.

@jiandewang
Copy link
Collaborator

some good news:
I tried HR4 tag, the only thing I changed is WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS from 20 to 10 and model is running Note my run is S2S. See log file at
/work/noaa/marine/Jiande.Wang/HERCULES/HR4/work/HR4-20191203/COMROOT/2019120300/HR4-20191203/logs/2019120300/gfsfcst_seg0.log

@jiandewang
Copy link
Collaborator

my 48hr run finished

@JessicaMeixner-NOAA
Copy link
Collaborator

@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work?

I'm also trying on hercules to replicated @jiandewang's success but with S2SW.

@jiandewang
Copy link
Collaborator

I also lanched one S2SW but it's still in pending status

@JessicaMeixner-NOAA
Copy link
Collaborator

WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 with S2S did not work on orion: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t03/COMROOT/C1152t03/logs/2019120300/gfs_fcst_seg0.log

@jiandewang
Copy link
Collaborator

mine is on hercules

@jiandewang
Copy link
Collaborator

@JessicaMeixner-NOAA my gut feeling is the issue is related to the memory/node, hercules has more than orion. Maybe you can try 5 on orion

@aerorahul
Copy link
Contributor

@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work?

I'm also trying on hercules to replicated @jiandewang's success but with S2SW.

Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo.

@DusanJovic-NOAA
To run w/ traditional threading, what else did you update in the test case borrowed from @JessicaMeixner-NOAA?

@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Nov 8, 2024

I only changed ufs.configure:

  1. remove all components except ATM
  2. change globalResourceControl: from true to false
  3. change ATM_petlist_bounds: to be 0 3023 - this numbers are lower and upper bounds of MPI ranks (0 based) used by the ATM model, in this case 24166 + 2360, where 24 and 16 are layout values from input.nml and 2360 are write comp values from model_configure
  4. change ATM_omp_num_threads: from 4 to 1

And, I added job_card by copying one of the job_card from regression test run and changed:

  1. export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads
  2. srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads
  3. #SBATCH --nodes=152
    #SBATCH --ntasks-per-node=80

80 is then number of cores on hercules compute nodes
152 is the minimal number of nodes such that 152*80 >= 3024

@aerorahul
Copy link
Contributor

I only changed ufs.configure:

  1. remove all components except ATM
  2. change globalResourceControl: from true to false
  3. change ATM_petlist_bounds: to be 0 3023 - this numbers are lowe and upper bounds of MPI ranks used by the ATM model, in this case 24_16_6 + 2_360, where 24 and 16 are layout values from input.nml and 2_360 are write comp values from model_configure

And, I added job_card by copying one of the job_card from regression test run and changed:

  1. export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads
  2. srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads
  3. #SBATCH --nodes=152
    #SBATCH --ntasks-per-node=80

80 is then number of cores on hercules compute nodes 152 is the minimal number of nodes such that 152*80 >= 3024

Ok. Yes. That makes sense for the atm-only.
Does your ufs.configure have a line for

ATM_omp_num_threads:            @[atm_omp_num_threads]

@[atm_omp_num_threads] would have been 4. Did you remove it? Or does it not matter since globalResourceControl is set to false?

The original value for ATM_petlist_bounds must have been 0 755 that you changed to 0 3023, I am assuming.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

@DusanJovic-NOAA
Copy link
Collaborator

I just fixed my comment about ATM_omp_num_threads:. I set it to 1 from 4, I'm not sure if it's ignored when globalResourceControl is set to false

The original value for ATM_petlist_bounds was something like 12 thousand or something like that, that included MPI ranks times 4 threads.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

@aerorahul
Copy link
Contributor

@JessicaMeixner-NOAA
I think the global-workflow is coded to use the correct ufs_configure template and set the appropriate values for PETLIST_BOUNDS and OMP_NUM_THREADS in the ufs_configure file.
The default in the global-workflow is to use ESMF_THREADING = YES. I am pretty sure one could use traditional threading as well, but is an unconfirmed fact as there was still work being done to confirm traditional threading will work on WCOSS2 with the slignshot updates and whatnot. Details on that are fuzzy to me at the moment.

BLUF, you/someone from the applications team could try traditional threading and we could gain some insight on performance at those resolutions. Thanks~

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

@aerorahul
Copy link
Contributor

Ok. @GeorgeVandenberghe-NOAA. Where do we employ traditional threading C768 and up? If so, we can set a flag in the global-workflow for those resolutions to use traditional threading. It should be easy enough to set that up.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

@JessicaMeixner-NOAA
Copy link
Collaborator

Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either.

@JessicaMeixner-NOAA
Copy link
Collaborator

Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either.

Note this was with added waves - so this might have also failed for @jiandewang if he has used waves.

@jiandewang
Copy link
Collaborator

summary for more tests I did on HERCULES:
(1) S2S, fv3 layout=8x16, write task per group=10, runs fine, further repeated 3 cases, all fine
(2) same as (1) but layout=24x16, hang
(3) repeat (1) and (2) but S2SW, all hang

@JacobCarley-NOAA
Copy link

@DeniseWorthen Thanks so much for your efforts. Please proceed to return to the grid imprint issue (#2466).

@JessicaMeixner-NOAA I think the ability to run with traditional threading (no managed threading) was added to GW earlier this year (see GW Issue 2277). However, I'm not sure if it's working. If it's not, I'd recommend proceeding with opening a new issue for this feature. Since something might already exist, hopefully it's not too much of a lift to get it going. This will hopefully get you working in the short-ish term.

Now, there's still something going on that we need understand. @GeorgeVandenberghe-NOAA Would you be able to continue digging into this issue?

@JessicaMeixner-NOAA
Copy link
Collaborator

JessicaMeixner-NOAA commented Nov 22, 2024

@JacobCarley-NOAA a comment from @aerorahul earlier in this thread:

Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo.

I'll open a g-w issue (update: g-w issue: NOAA-EMC/global-workflow#3122)

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@JacobCarley-NOAA
Copy link

Thanks @GeorgeVandenberghe-NOAA! Just send me a quick note offline (email is fine) when you need a component expert to jump in and I'll be happy to coordinate accordingly.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

It looks like the hangs are related to the total number of WAVE tasks but are also related to total resource usage.

I have verified that a 16x16 decomposition (ATM) with traditional threads (two per rank) and 1400 wave ranks does not hang on either Orion or Hercules but a 24x32 decomposition with 1400 wave ranks does. 998 rank runs do get through with a 24x32 decomposition. So it looks like total job resources is a contributing issue. It isn't just a hard barrier that we can't run 1400 wave tasks on orion or hercules.

@aerorahul
Copy link
Contributor

@RuiyuSun
I have implemented a traditional threading option in the global-workflow with suggestions from @junwang-noaa and @DusanJovic-NOAA. global-workflow PR 3149 is under review.
I have tested the case of C768 S2SW on Hercules. Please see the details and changes in the open PR.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@theurich
Copy link
Collaborator

I am curious, has the HRv4 configuration (with ESMF-managed threading) been run on Gaea and/or WCOSS2. If so, does it also hang in the same way? Sorry if this was discussed already, and I overlooked it in the discussion.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA
Copy link
Collaborator

I am curious, has the HRv4 configuration (with ESMF-managed threading) been run on Gaea and/or WCOSS2. If so, does it also hang in the same way? Sorry if this was discussed already, and I overlooked it in the discussion.

@theurich we ran HR4 on WCOSS2. I can't remember if we had trouble finding a node combination that worked, but all of HR4 ran reliably - although @jiandewang could correct if I'm wrong. I don't think HR4 was run on gaea but George might have run tests there.

@DeniseWorthen
Copy link
Collaborator

@theurich I don't want to muddy the water, but we have been seeing issues w/ ESMF-MT on Gaea-C6 for just the RTs. See #2448 (comment) and #2448 (comment)

@theurich
Copy link
Collaborator

No. It is much more reliable on WCOSS2 and Gaea. Orion and Hercules are the two systems we see these hangs. It would likely happen on hera too but hera is too small and busy to even try this on that system. The constraint on ESMF managed threading on Gaea and WCOSS2 is the inability to spawn more than about 21000 MPI ranks so we have to go to traditional threading for high resolution and fast runtimes to stay under 21000 MPI ranks.

That is good to know. So an interesting twist here is that we think that ESMF v8.8.0b09 addresses the issue with the higher MPI task count on GAEA and WCOSS2. In fact it was sort of our hope of pushing this beta out for UFS... but then we learned about this new issue on Hercules and Orion with HRv4 and ESMF-MT. ... which we are working on now, but seems like a separate issue.

Bottom line, if someone could try a larger > 21kPET job on GAEA or WCOSS2 with ESMF v8.8.0b09 that would be interesting. The recommendation with this beta is to NOT switch to UCX, but use the default OFI on Cray Slingshot.

@theurich
Copy link
Collaborator

@theurich I don't want to muddy the water, but we have been seeing issues w/ ESMF-MT on Gaea-C6 for just the RTs. See #2448 (comment) and #2448 (comment)

@DeniseWorthen interesting... I will read through those issues. Thanks!

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@jiandewang
Copy link
Collaborator

I am curious, has the HRv4 configuration (with ESMF-managed threading) been run on Gaea and/or WCOSS2. If so, does it also hang in the same way? Sorry if this was discussed already, and I overlooked it in the discussion.

@theurich we ran HR4 on WCOSS2. I can't remember if we had trouble finding a node combination that worked, but all of HR4 ran reliably - although @jiandewang could correct if I'm wrong. I don't think HR4 was run on gaea but George might have run tests there.

we ran hundreds of HR4 cases on wcoss2 and didn't have any hanging case, and we adjusted resources for FV3 for speed and job turn around puropse (not for the purpose of node combination) without issue.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@DusanJovic-NOAA
Copy link
Collaborator

Gerhard (@theurich) suggests we have this variable FI_MLX_INJECT_LIMIT=0 set in job scripts on Hercules (and probably Orion). I tried c1152s2sw test on Hercules, using ESMF managed threading, and it works fine.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA
Copy link
Collaborator

Gerhard (@theurich) suggests we have this variable FI_MLX_INJECT_LIMIT=0 set in job scripts on Hercules (and probably Orion). I tried c1152s2sw test on Hercules, using ESMF managed threading, and it works fine.

This also seems to help me - I'll post log files later (it's still pretty slow, so I am running for shorter so the full run finishes).

@theurich
Copy link
Collaborator

Gerhard (@theurich) suggests we have this variable FI_MLX_INJECT_LIMIT=0 set in job scripts on Hercules (and probably Orion). I tried c1152s2sw test on Hercules, using ESMF managed threading, and it works fine.

This also seems to help me - I'll post log files later (it's still pretty slow, so I am running for shorter so the full run finishes).

@JessicaMeixner-NOAA do you have ESMF profiling enabled for these runs. If so, I would be interested in looking at the profile summary to see if anything obvious sticks out wrt performance.

@JessicaMeixner-NOAA
Copy link
Collaborator

@theurich I do - that's actually why I shortened the forecast length so we could actually get the report and not go over the wallclock. I'll post the location here when complete.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA
Copy link
Collaborator

JessicaMeixner-NOAA commented Dec 12, 2024

@theurich In this configuration the wave component is slow - I'll try increasing wave nodes and using PIO. Note both are using a branch of WW3 as I was trying to test initialization speedup (which we do see in the second):

/work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t03/COMROOT/t03/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary

/work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t04/COMROOT/t04/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary

Edit: Wave is slower than [fv3_fcst] RunPhase1 , but not [ATM] RunPhase1

@DeniseWorthen
Copy link
Collaborator

I agree that using PIO for WW3 might help. See the memory figures I posted here, the middle panels (VmRSS) and the drop in memory req'd when not loading everything onto the last PET for binary restarts.

That said, I can't explain why the memory sizes are << node memory, regardless. It seems like there should be plenty of memory, even w/o PIO+WW3.

@theurich
Copy link
Collaborator

@theurich In this configuration the wave component is slow - I'll try increasing wave nodes and using PIO. Note both are using a branch of WW3 as I was trying to test initialization speedup (which we do see in the second):

/work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t03/COMROOT/t03/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary

/work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t04/COMROOT/t04/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary

Edit: Wave is slower than [fv3_fcst] RunPhase1 , but not [ATM] RunPhase1

@JessicaMeixner-NOAA Just to let you know that I don't have an account on Hercules. Will need a copy of the ESMF_Profile.summary files on Hera or Gaea to look at them. Thanks.

@JessicaMeixner-NOAA
Copy link
Collaborator

JessicaMeixner-NOAA commented Dec 16, 2024

@theurich In this configuration the wave component is slow - I'll try increasing wave nodes and using PIO. Note both are using a branch of WW3 as I was trying to test initialization speedup (which we do see in the second):
/work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t03/COMROOT/t03/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary
/work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t04/COMROOT/t04/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary
Edit: Wave is slower than [fv3_fcst] RunPhase1 , but not [ATM] RunPhase1

@JessicaMeixner-NOAA Just to let you know that I don't have an account on Hercules. Will need a copy of the ESMF_Profile.summary files on Hera or Gaea to look at them. Thanks.

@theurich theurich They are on hera here:
/scratch1/NCEPDEV/climate/Jessica.Meixner/ForGerhard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests