-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSI built with debug mode failed in the test global_4denvar on wcoss2 #712
Comments
WCOSS2 test The following has been done on Cactus
global_4denvar_loproc_updat and global_4denvar_hiproc_updat ran to completion in debug mode. Neither job hangs. The loproc job took 2835.414645 seconds to complete. The hiproc job took 1534.105594 seconds to complete. Interestingly (and disturbingly) the initial gradients between the loproc and hiproc jobs differ in the 10th printed digit. The initial total penalties are identical for all 19 printed digits.. loproc
hiproc
Differences in WCOSS2 results were observed in PR #616 and #692. Refactoring code yielded reproducible results with respect to the control. Now loproc and hiproc debug runs demonstrate lack of reproducibility. The WCOSS2 build uses hpc-stack with an older version of the intel compiler. The GSI builds on other platforms use spack-stack modules and newer intel compilers. Are we dealing with a compiler or module issue on WCOSS2? Would repeating the above test on other platforms yield non-reproducible loproc and hiproc results? |
@RussTreadon-NOAA Thanks. I will do some further digging for some remaining questions to me and come back with an update. The failure of global_4densvar for non-reproducible results (update vs contrl) is reported in #679 (comment). |
@xincjin-NOAA 's PR #692 yields reproducible global_4denvar results on WCOSS2 (Cactus). PR #692 is now the head of |
GSI The DOYR mnemonic was replaced by MNTH DAYS effective 20240131 18Z. The GOME reader in Issue #716 was opened to document addition of the required changes to |
By adding extra debug compiler options ( -init=snan,arrays ) , an apparent mis use of variables was found in read_nsstbufr.f90. To document this fix, a draft pr was created at my fork : TingLei-daprediction#2. |
A reduced version of global_4densvar was run with radiance obs removed. The obs setup in gisparm.anl is as below
The similar behavior of GSI was found, namely, loproc_contrl and loproc_updat show identical results while the hiproc ones show differences from the lorproc ones and between themselves. So, the culprit seems not specific to radiance observations. |
Another "reduced" version of the global_4densvar still showed the same behavior, in which , only static B was used (namely, a 3DVar with fgat). |
An interesting findings: running global_4densvar test with a reduced setup as in the previous runs ( only 2 global members were used), when factqmin=factqmax=0 ( namely this constraint is turned off), this test would indeed succeed (only with " Failure of max-time in the regression test"). |
Another update: using debug mode built GSI, the global_4densvar failed on hera for the same reason as on wcoss2, though GSI (both update and contrl) hasn't been updated to the current head of emc gsi. |
A modification within one OpenMP directive appears to have addressed the reproducibility issue observed between loproc and hiproc runs in the reduced version of global_4denvar ( only use 2 members and the maximum inner iteration steps of 5 ). |
An update on the global_4densvar using update and control updated with the current head of EMC GSI develop branch. |
I suggest a separate PR for the |
On wcoss2, when GSI is built with the debug mode, GSI would become idle and the job would finally be killed for , like ,
the error message would show:
The line 311 in genstat_gps.f90 is
The reason for GSI hanging at this point needs to be investigated.
Added on Mar. 15,2024, another issue was found that loproc_updat !=hiproc_updat and loproc_contrl !=hirpoc_contrl and hirpoc_updat !=hirpoc_contrl , only loproc_contrl=loproc_updat.
The text was updated successfully, but these errors were encountered: