-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust C768 resources for Hera #2819
Adjust C768 resources for Hera #2819
Conversation
* origin/develop: Stage atmospheric backgrounds and UFS cubed-sphere history files (NOAA-EMC#2792) Check that a PR driver is still running before trying to kill it (NOAA-EMC#2799) Feature/get arch adds an empty archive job to GEFS system (NOAA-EMC#2772) Marine DA updates (NOAA-EMC#2802) Revert MSU FIX_DIRs back to glopara (NOAA-EMC#2811) Bugfix for updating label states in Jenkins (NOAA-EMC#2808) Clean-up temporary rundirs - take 2. (NOAA-EMC#2753) Change land surface for HR4 (NOAA-EMC#2787) Run METplus serially and correct the name of prod tasks (NOAA-EMC#2804) Update Java Agent launching script for Jenkins connections (NOAA-EMC#2762) Fix erroneous cdump addition (NOAA-EMC#2803) Update ocean post-processing triggers (NOAA-EMC#2784) Update the gfs_utils repository hash (NOAA-EMC#2801) Add fixes for metplus jobs when gfs_cyc=2 or 4 (NOAA-EMC#2791) Simplify resource-related variables, remove CDUMP where unneeded (NOAA-EMC#2727) Remove f000 from atmos rocoto tasks for replay cases (NOAA-EMC#2778)
It seems that this pull request does not resolve the issue with eupd. For further details, you may review the log files located at: |
@spanNOAA You are correct. Somewhere along the way, I dropped the working configuration by accident. It should be running two tasks/node (i.e. I am working presently to see if I can spread the tasks out differently so more tasks can run on each node. The issue is that the first 4 tasks (the I/O tasks) must store an enormous amount of data in memory (about 40GB each). The remaining tasks are not as memory intensive. One way to solve this is to develop an
But this is not trivial to program. I will likely go with |
@DavidHuber-NOAA Thanks for the update. |
Thanks, @DavidHuber-NOAA! @spanNOAA Does |
@DavidHuber-NOAA ntasks=80 works now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR solved our C768 issues on Hera.
Thanks, @DavidHuber-NOAA
@WalterKolczynski-NOAA @aerorahul @RussTreadon-NOAA Note that this PR (and the dependent gsi-utils PR NOAA-EMC/GSI-utils#49) has changes that will affect the C768 forecast on all machines and the |
Thank you @DavidHuber-NOAA for running tests on WCOSS2. Hopefully the changes to |
@CatherineThomas-NOAA , tagging you for awareness. Not sure if / how these changes might impact plans for using Hera to run GFS v17 parallels. |
Co-authored-by: Walter Kolczynski - NOAA <Walter.Kolczynski@noaa.gov>
CI Failed on Orion in Build# 3
|
We really should be setting his in the hosts file. |
@TerrenceMcGuinness-NOAA Can you tell what happened with the Orion CI tests? It almost looks like the EXPDIR directories were deleted mid-run. |
@DavidHuber-NOAA Yes it did ... The Jenkins custom Workspace is designated per the PR name. When a PR CI is re-ran from a previous fail it uses the same path and removes COMROOT and EXPDIR under its RUNDIRS directory. I couldn't tell what happened specificly with the miss labeling from the logs. My hunch is that the jobs was actually hung in the controller. |
This will require a change to all of the host files and the CI defaults (which override whatever is in the host files) and will need to be tested on each platform. I would prefer to open a new issue to tackle this as this PR is already tackling a few different issues and going beyond its scope. |
Opened #2942 |
CI Passed on Hercules in Build# 4
|
Looking through Orion's logs, I see that all experiments completed successfully. However, for some reason the |
Experiment C48_S2SWA_gefs FAILED on Orion in Build# 5 in |
CI Failed on Orion in Build# 5
Experiment C48_S2SWA_gefs_1921288e Terminated with tasks failed and dead at Thu Sep 19 11:12:47 AM CDT 2024 |
Description
This modifies the resources for gdasfcst (everywhere) and enkfgdaseupd (Hera only). For the fcst job, the number of write tasks is increased to prevent out of memory errors from the inline post. For the eupd, the number of tasks is decreased to prevent out of memory errors. The runtime for the eupd job was just over 10 minutes.
Resolves #2506
Resolves #2498
Resolves #2916
Type of change
Change characteristics
How has this been tested?
Successfully ran through the mentioned jobs at least once each. More testing to come.
Checklist