Adjust C768 resources for Hera #2819

DavidHuber-NOAA · 2024-08-09T17:44:01Z

Description

This modifies the resources for gdasfcst (everywhere) and enkfgdaseupd (Hera only). For the fcst job, the number of write tasks is increased to prevent out of memory errors from the inline post. For the eupd, the number of tasks is decreased to prevent out of memory errors. The runtime for the eupd job was just over 10 minutes.

Resolves #2506
Resolves #2498
Resolves #2916

Type of change

Bug fix (fixes something broken)

Change characteristics

Is this a breaking change (a change in existing functionality)? YES/NO
Does this change require a documentation update? YES/NO
Does this change require an update to any of the following submodules? YES (If YES, please add a link to any PRs that are pending.)
- GSI-utils Send/receive layers to reduce buffer transfer time GSI-utils#49 and Bugfix for where some fields are not in the input increment file GSI-utils#53

How has this been tested?

Successfully ran through the mentioned jobs at least once each. More testing to come.

Checklist

Any dependent changes have been merged and published
My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
New and existing tests pass with my changes

* origin/develop: Stage atmospheric backgrounds and UFS cubed-sphere history files (NOAA-EMC#2792) Check that a PR driver is still running before trying to kill it (NOAA-EMC#2799) Feature/get arch adds an empty archive job to GEFS system (NOAA-EMC#2772) Marine DA updates (NOAA-EMC#2802) Revert MSU FIX_DIRs back to glopara (NOAA-EMC#2811) Bugfix for updating label states in Jenkins (NOAA-EMC#2808) Clean-up temporary rundirs - take 2. (NOAA-EMC#2753) Change land surface for HR4 (NOAA-EMC#2787) Run METplus serially and correct the name of prod tasks (NOAA-EMC#2804) Update Java Agent launching script for Jenkins connections (NOAA-EMC#2762) Fix erroneous cdump addition (NOAA-EMC#2803) Update ocean post-processing triggers (NOAA-EMC#2784) Update the gfs_utils repository hash (NOAA-EMC#2801) Add fixes for metplus jobs when gfs_cyc=2 or 4 (NOAA-EMC#2791) Simplify resource-related variables, remove CDUMP where unneeded (NOAA-EMC#2727) Remove f000 from atmos rocoto tasks for replay cases (NOAA-EMC#2778)

spanNOAA · 2024-08-15T03:57:52Z

It seems that this pull request does not resolve the issue with eupd. For further details, you may review the log files located at:
/scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/stmp/RUNDIRS/C768_6hourly_0210/eupd.1499050/stderr
/scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/stmp/RUNDIRS/C768_6hourly_0210/eupd.2575750/stderr
These logs should provide more insight into the problem.

DavidHuber-NOAA · 2024-08-15T15:14:14Z

@spanNOAA You are correct. Somewhere along the way, I dropped the working configuration by accident. It should be running two tasks/node (i.e. threads_per_task=20).

I am working presently to see if I can spread the tasks out differently so more tasks can run on each node. The issue is that the first 4 tasks (the I/O tasks) must store an enormous amount of data in memory (about 40GB each). The remaining tasks are not as memory intensive. One way to solve this is to develop an arbitrary distribution scheme rather than the default block distribution. Something like

srun -N <n nodes> ... --distribution=arbitrary -w tux[0,1,2,3,4,4,4,4,4,5,5,5,5,5,...,n-1, n-1, n-1, n-1, n-1]

But this is not trivial to program. I will likely go with threads_per_task=20 for the time being, but want to run one more test first and then I will update the PR.

spanNOAA · 2024-08-15T16:33:01Z

@spanNOAA You are correct. Somewhere along the way, I dropped the working configuration by accident. It should be running two tasks/node (i.e. threads_per_task=20).

I am working presently to see if I can spread the tasks out differently so more tasks can run on each node. The issue is that the first 4 tasks (the I/O tasks) must store an enormous amount of data in memory (about 40GB each). The remaining tasks are not as memory intensive. One way to solve this is to develop an arbitrary distribution scheme rather than the default block distribution. Something like
srun -N <n nodes> ... --distribution=arbitrary -w tux[0,1,2,3,4,4,4,4,4,5,5,5,5,5,...,n-1, n-1, n-1, n-1, n-1]
But this is not trivial to program. I will likely go with threads_per_task=20 for the time being, but want to run one more test first and then I will update the PR.

@DavidHuber-NOAA Thanks for the update.

guoqing-noaa · 2024-08-15T16:52:53Z

Thanks, @DavidHuber-NOAA!

@spanNOAA Does threads_per_task=20 work for your case?

spanNOAA · 2024-08-19T20:13:29Z

@DavidHuber-NOAA ntasks=80 works now.

guoqing-noaa

This PR solved our C768 issues on Hera.

Thanks, @DavidHuber-NOAA

DavidHuber-NOAA · 2024-08-20T12:08:08Z

@WalterKolczynski-NOAA @aerorahul @RussTreadon-NOAA Note that this PR (and the dependent gsi-utils PR NOAA-EMC/GSI-utils#49) has changes that will affect the C768 forecast on all machines and the gdasanalcalc job at all resolutions. I have not performed any testing to verify that values did not change. I should be able to do that on WCOSS2 next week. Opening for review in the meantime, but please do not merge until I have done that testing.

RussTreadon-NOAA · 2024-08-20T13:54:49Z

Thank you @DavidHuber-NOAA for running tests on WCOSS2. Hopefully the changes to gdasanalcalc via GSI-utils PR #49 do not alter results.

RussTreadon-NOAA · 2024-08-20T14:03:56Z

@CatherineThomas-NOAA , tagging you for awareness. Not sure if / how these changes might impact plans for using Hera to run GFS v17 parallels.

parm/config/gfs/config.resources.HERA

Co-authored-by: Walter Kolczynski - NOAA <Walter.Kolczynski@noaa.gov>

emcbot · 2024-09-19T16:24:37Z

CI Failed on Orion in Build# 3
Built and ran in directory /work2/noaa/stmp/CI/ORION/2819


Experiment C48_S2SWA_gefs_1921288e Terminated with  tasks failed and  dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C48_S2SWA_gefs_1921288e Terminated: **
Experiment C96_atm3DVar_1921288e Terminated with  tasks failed and  dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C96_atm3DVar_1921288e Terminated: **
Experiment C48_S2SW_1921288e Terminated with  tasks failed and  dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C48_S2SW_1921288e Terminated: **
Experiment C96C48_hybatmDA_1921288e Terminated with  tasks failed and  dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C96C48_hybatmDA_1921288e Terminated: **

WalterKolczynski-NOAA · 2024-09-19T16:42:21Z

The tracker is explicitly disabled on Hercules in the J-Jobs:

global-workflow/jobs/JGFS_ATMOS_CYCLONE_GENESIS

Line 9 in 7588d2b

if [[ "${machine}" == 'HERCULES' ]]; then exit 0; fi

I'll remove this and place it in config.base instead for both Hercules and Orion.

We really should be setting his in the hosts file.

DavidHuber-NOAA · 2024-09-19T16:49:50Z

@TerrenceMcGuinness-NOAA Can you tell what happened with the Orion CI tests? It almost looks like the EXPDIR directories were deleted mid-run.

TerrenceMcGuinness-NOAA · 2024-09-19T17:07:40Z

@DavidHuber-NOAA
! Can you tell what happened with the Orion CI tests? It almost looks like the EXPDIR directories were deleted mid-run.

Yes it did ...

The Jenkins custom Workspace is designated per the PR name. When a PR CI is re-ran from a previous fail it uses the same path and removes COMROOT and EXPDIR under its RUNDIRS directory. I couldn't tell what happened specificly with the miss labeling from the logs. My hunch is that the jobs was actually hung in the controller.

DavidHuber-NOAA · 2024-09-19T17:20:17Z

The tracker is explicitly disabled on Hercules in the J-Jobs:

global-workflow/jobs/JGFS_ATMOS_CYCLONE_GENESIS

Line 9 in 7588d2b

if [[ "${machine}" == 'HERCULES' ]]; then exit 0; fi

I'll remove this and place it in config.base instead for both Hercules and Orion.

We really should be setting his in the hosts file.

This will require a change to all of the host files and the CI defaults (which override whatever is in the host files) and will need to be tested on each platform. I would prefer to open a new issue to tackle this as this PR is already tackling a few different issues and going beyond its scope.

DavidHuber-NOAA · 2024-09-19T17:23:52Z

Opened #2942

emcbot · 2024-09-19T19:57:10Z

CI Passed on Hercules in Build# 4
Built and ran in directory /work2/noaa/stmp/CI/HERCULES/2819


Experiment C48_ATM_62ea3892 Completed 1 Cycles: *SUCCESS* at Thu Sep 19 13:24:53 CDT 2024
Experiment C48_S2SW_62ea3892 Completed 1 Cycles: *SUCCESS* at Thu Sep 19 14:31:32 CDT 2024
Experiment C96_atm3DVar_62ea3892 Completed 3 Cycles: *SUCCESS* at Thu Sep 19 14:43:42 CDT 2024
Experiment C96C48_hybatmDA_62ea3892 Completed 3 Cycles: *SUCCESS* at Thu Sep 19 14:56:08 CDT 2024
Experiment C48_S2SWA_gefs_62ea3892 Completed 1 Cycles: *SUCCESS* at Thu Sep 19 14:56:59 CDT 2024

DavidHuber-NOAA · 2024-09-20T11:12:21Z

Looking through Orion's logs, I see that all experiments completed successfully. However, for some reason the C96_atm3DVar and C48_ATM tests, once completed, did not trigger SUCCESS notifications. Manually changing label to CI-Orion-Passed.

emcbot · 2024-09-20T14:09:49Z

Experiment C48_S2SWA_gefs FAILED on Orion in Build# 5 in
/work2/noaa/stmp/CI/ORION/2819/RUNTESTS/EXPDIR/C48_S2SWA_gefs_62ea3892

emcbot · 2024-09-20T14:10:02Z

CI Failed on Orion in Build# 5
Built and ran in directory /work2/noaa/stmp/CI/ORION/2819


**CI Failed** on Orion in Build# 3<br>Built and ran in directory `/work2/noaa/stmp/CI/ORION/2819`

Experiment C48_S2SWA_gefs_1921288e Terminated with tasks failed and dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C48_S2SWA_gefs_1921288e Terminated: **
Experiment C96_atm3DVar_1921288e Terminated with tasks failed and dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C96_atm3DVar_1921288e Terminated: **
Experiment C48_S2SW_1921288e Terminated with tasks failed and dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C48_S2SW_1921288e Terminated: **
Experiment C96C48_hybatmDA_1921288e Terminated with tasks failed and dead at Thu Sep 19 11:12:47 AM CDT 2024
Experiment C96C48_hybatmDA_1921288e Terminated: **
Experiment C48_ATM_62ea3892 Completed 1 Cycles: SUCCESS at Thu Sep 19 07:33:08 PM CDT 2024
Experiment C96_atm3DVar_62ea3892 Completed 3 Cycles: SUCCESS at Fri Sep 20 05:59:17 AM CDT 2024
Experiment C96C48_hybatmDA_62ea3892 Completed 3 Cycles: SUCCESS at Fri Sep 20 06:48:06 AM CDT 2024
Experiment C48_S2SW_62ea3892 Completed 1 Cycles: SUCCESS at Fri Sep 20 07:06:25 AM CDT 2024
Experiment C48_S2SWA_gefs_62ea3892 Terminated with tasks failed and dead at Fri Sep 20 09:09:42 AM CDT 2024
Experiment C48_S2SWA_gefs_62ea3892 Terminated: **

DavidHuber-NOAA added 4 commits August 9, 2024 15:19

Add level-by-level fix for netcdf_io from gsi_utils

615350a

Add more write groups to spread memory requirements

9a024e3

Reduce eupd tasks/threads on Hera

fbd0660

DavidHuber-NOAA requested a review from guoqing-noaa August 9, 2024 17:44

DavidHuber-NOAA mentioned this pull request Aug 9, 2024

C768 analysis tasks Fail on Hera #2498

Closed

Renamed custom eupd resources

49a4cd3

Correct resources for eupd, upp jobs @c768

f9fcc89

guoqing-noaa approved these changes Aug 19, 2024

View reviewed changes

Merge branch 'develop' into fix/c768_hera

a961da1

DavidHuber-NOAA marked this pull request as ready for review August 20, 2024 12:00

DavidHuber-NOAA requested review from aerorahul, WalterKolczynski-NOAA and RussTreadon-NOAA August 20, 2024 12:00

DavidHuber-NOAA added the blocked Issue is currently being blocked by another issue label Aug 20, 2024

WalterKolczynski-NOAA requested changes Aug 22, 2024

View reviewed changes

parm/config/gfs/config.resources.HERA Outdated Show resolved Hide resolved

DavidHuber-NOAA and others added 3 commits August 22, 2024 12:08

Update parm/config/gfs/config.resources.HERA

10f7c9b

Co-authored-by: Walter Kolczynski - NOAA <Walter.Kolczynski@noaa.gov>

Merge remote-tracking branch 'emc/develop' into fix/c768_hera

0491fca

Merge remote-tracking branch 'emc/develop' into fix/c768_hera

80c5b55

DavidHuber-NOAA mentioned this pull request Sep 5, 2024

C768 gdasfcst runs too slow on WCOSS2 #2891

Closed

DavidHuber-NOAA and others added 2 commits September 6, 2024 14:30

Update gsi_utils hash

6ba5131

Merge branch 'develop' into fix/c768_hera

7b0619e

emcbot added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Sep 19, 2024

TerrenceMcGuinness-NOAA added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion and removed CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed labels Sep 19, 2024

emcbot added CI-Hercules-Running **Bot use only** CI testing on Hercules for this PR is in-progress and removed CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules labels Sep 19, 2024

Merge branch 'develop' into fix/c768_hera

84f3c42

DavidHuber-NOAA added CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Sep 20, 2024

DavidHuber-NOAA merged commit 3c86873 into NOAA-EMC:develop Sep 20, 2024
5 checks passed

DavidHuber-NOAA deleted the fix/c768_hera branch September 20, 2024 11:17

emcbot added the CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed label Sep 20, 2024

emcbot added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed labels Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust C768 resources for Hera #2819

Adjust C768 resources for Hera #2819

DavidHuber-NOAA commented Aug 9, 2024 •

edited

Loading

spanNOAA commented Aug 15, 2024

DavidHuber-NOAA commented Aug 15, 2024

spanNOAA commented Aug 15, 2024

guoqing-noaa commented Aug 15, 2024

spanNOAA commented Aug 19, 2024

guoqing-noaa left a comment

DavidHuber-NOAA commented Aug 20, 2024

RussTreadon-NOAA commented Aug 20, 2024

RussTreadon-NOAA commented Aug 20, 2024

emcbot commented Sep 19, 2024

WalterKolczynski-NOAA commented Sep 19, 2024

DavidHuber-NOAA commented Sep 19, 2024

TerrenceMcGuinness-NOAA commented Sep 19, 2024 •

edited

Loading

DavidHuber-NOAA commented Sep 19, 2024

DavidHuber-NOAA commented Sep 19, 2024

emcbot commented Sep 19, 2024

DavidHuber-NOAA commented Sep 20, 2024 •

edited

Loading

emcbot commented Sep 20, 2024

emcbot commented Sep 20, 2024

Adjust C768 resources for Hera #2819

Adjust C768 resources for Hera #2819

Conversation

DavidHuber-NOAA commented Aug 9, 2024 • edited Loading

Description

Type of change

Change characteristics

How has this been tested?

Checklist

spanNOAA commented Aug 15, 2024

DavidHuber-NOAA commented Aug 15, 2024

spanNOAA commented Aug 15, 2024

guoqing-noaa commented Aug 15, 2024

spanNOAA commented Aug 19, 2024

guoqing-noaa left a comment

Choose a reason for hiding this comment

DavidHuber-NOAA commented Aug 20, 2024

RussTreadon-NOAA commented Aug 20, 2024

RussTreadon-NOAA commented Aug 20, 2024

emcbot commented Sep 19, 2024

WalterKolczynski-NOAA commented Sep 19, 2024

DavidHuber-NOAA commented Sep 19, 2024

TerrenceMcGuinness-NOAA commented Sep 19, 2024 • edited Loading

DavidHuber-NOAA commented Sep 19, 2024

DavidHuber-NOAA commented Sep 19, 2024

emcbot commented Sep 19, 2024

DavidHuber-NOAA commented Sep 20, 2024 • edited Loading

emcbot commented Sep 20, 2024

emcbot commented Sep 20, 2024

DavidHuber-NOAA commented Aug 9, 2024 •

edited

Loading

TerrenceMcGuinness-NOAA commented Sep 19, 2024 •

edited

Loading

DavidHuber-NOAA commented Sep 20, 2024 •

edited

Loading