Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparation for Transition to C6 #78

Merged
merged 28 commits into from
Aug 19, 2024
Merged

Preparation for Transition to C6 #78

merged 28 commits into from
Aug 19, 2024

Conversation

yichengt900
Copy link
Contributor

@yichengt900 yichengt900 commented Aug 13, 2024

This PR addresses issue #79. I have conducted several tests on C6, including a 30-year long-term physics-only nudging run for NWA12 (see figures below). Most of those runs completed successfully, although I did experience a few failures in the model's historical data transfer, which required manual resubmission of the output.stage jobs. I have transferred some model input data for NWA12, NWA25, NEP10 (NEP_input and NEP_era5), and ARC12_pub to /gpfs/f6/ira-cefi/world-shared. Unfortunately, F6 cannot access F5 directly, so some of you may need to transfer specific forcing data (e.g., JRA forcings) from F5 to F6.

Gulf stream position:
gulfstream_eval

Cold water index:
coldpool_eval

Sea ice extension in Gulf of St. Lawerence:
gsl_extent

While we can now run the FRE workflow on C6, a few issues remain:

  • The default FRE/test cannot properly handle XML copying from Gaea to GFDL. Additionally, the C6 system lacks the lfs command, which is used to find and list file names. I've created a custom version of FRE/test, FRE group has fixed the bugs, so now when you log in to C6, please load fre/test with the following commands:
module use /ncrc/home2/fre/local/modulefiles
module load fre/test
  • You will need to update your platforms.xml to include ncrc6.intel23, as the default Intel compiler on C6 is now version 2023.2.0. Also the FRE/test still cannot work appropriately on GFDL PPAN during the PP step. For now, I recommend we continue using FRE/bronx-21 or FRE/bronx-22 on PPAN for the PP process (see the platforms.xml). FRE group fixed FRE/test for PPAN so we can use FRE/test for PP on PPAN now.

  • For some reasons I encountered a crash when running NWA12 using FMS1-intel23 build with mask_table on C6. The FMS2-intel23 build did not have this issue, so I recommend using FMS2 for now. We've resolved the nudging performance issue for FMS2, and it’s being used for other domains, so this shouldn’t be a significant problem.

  • Lastly, I've noticed some unusual behavior on F6, as well as with data transfers between PPAN and F6 (have opend several helpdesk tickets). Be aware that you might encounter some odd issues when working on F6.

08/13/2024

It looks like regression testing is now working on C6, but FRE is still not functioning properly.

08/14/2024

Although I haven't heard from MSD (received an email at 5:00 PM, which suggests that gcp is now working), I'm making some progress. The fremake and frerun processes are partially working with a few tweaks (the lfs command is missing on C6; ncrc6.inc was missing in hsmget/test but is now fixed; I had to maintain our own FRE environment to make it work).

Additionally, for some reason, the Intel23 FMS1 build didn't work with mask_table, so let's stick with FMS2 for now.

08/15/2024

A one-year NWA test run using FRE on C6 seems to be going well. However, the XML file isn't copying correctly to PPAN during the data transfer step, so PP won't work automatically. I’ll take a look and try to fix it when PPAN is back online.

It appears that the issue may be due to the patternSedF5 environment variable being set incorrectly in the fre/test. I have a temporary solution, but I will need to rerun a test to confirm. Additionally, data transfer is currently quite slow.

The temporal solution seems to work, but the gcp from fre/test did not work on PPAN. The temporal solution would be using FRE/bronx-22 on PPAN for PP. MSD has fixed this problem after reporting. I will re-run the whole test to make sure it works.

1-year run done successfully with PP. Conduct another 5-year run to make sure everything is good and we are ready to go!

08/16/2024

The 5-year run is still in progress. The model simulation appears to be functioning correctly, but I encountered output.stager job failures starting in the second year, along with some unusual filesystem behavior. I believe this PR is ready, but the C6/F6 system may still need some tuning.

CC @charliestock to keep you in the loop.

@yichengt900 yichengt900 self-assigned this Aug 13, 2024
Copy link
Contributor

@andrew-c-ross andrew-c-ross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here seems to work for me. Thanks Yi-Cheng! The only thing I haven't tested out yet is post-processing.

Side note, that is the best Gulf Stream and cold pool simulation I've ever seen with this model!

@charliestock
Copy link

Hi Yi-Cheng - thanks for all of your hard work on this. Amazing progress, and I have to agree with Andrew. That Gulf Stream looks amazing! Is that result deterministic or just a lucky spin?

@yichengt900
Copy link
Contributor Author

@charliestock I would say it's half-half. The Glorys data nudging likely played a role, but as I recall, we didn't achieve such good results even with nudging in the past?

@yichengt900 yichengt900 merged commit 214d998 into main Aug 19, 2024
3 checks passed
@yichengt900 yichengt900 deleted the feature/c6 branch August 19, 2024 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants