-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete issue #607 (Optimization of multiscale regional runs) #670
Conversation
…FV3 DA (FV3LAMDA)
An update : the failure with hafs regression test when gsi is compiled in debug mode was from the undefined toff in read_radar_l2rw_novadqc. The issue existing in the current EMC GSI and already have a PR for fix https://github.com/NOAA-EMC/GSI/pull/679/files#diff-9237af61f8a4787143a6ef94bfdefcd7a764f8f986e54370beffb92b3f08d2d3. |
An update : when EnKF is also compiled with debug mode, the current EMC GSI (and, hence, including this PR) would fail the regression test global_enkf . There are two places in letkf.f90. The first is easy to find why and fix.
(and the similar lines for tmean), tmean is only valid on rank 0 after mpi_reduce, but is used in the following line on every mpi rank. The second issue is more tricky. The error message is :
The line 281 is the beginning line of a large block of openmp directives :
I haven't identified the culprit for this error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all modifications! I approved now.
I did the Orion tests as follows.
Non-reproducibility in global_4denvar, hafs_4denvar_glbens, and hafs_3denvar_hybens is as we expected. There are no fatal issues. |
I ran some more full resolution tests on WCOSS2 at b7bdbc4, this time for the early dump. Here are the timings:
All PR tests are faster than all control tests. My findings for the cost function are similar to my previous full resolution tests. Cost functions are identical within each test (i.e. all pmot=0 are the same). The pmot=0 and pmot=2 cases have identical cost functions. The difference between the control and pmot initial cost functions are identical to 7 digits, which is more different than the regression tests:
|
Sho, thanks for the approval. |
Cathy, Thanks for running this experiment! Glad you got positive results! But a bit confused why you got different initial penalties. What observation types produced a different initial penalty (e.g., radiances, wind, etc.)? Also, can you send me the diff of the output diagnostic files (fort.2**) for the observation type for the first outer iteration? Thanks! |
The due date for completion of this PR is Thursday, 1/25/2024. @shoyokota has approved the PR. @TingLei-NOAA , what outstanding questions or issues do you need answered or resolved before you can complete your review? |
@RussTreadon-NOAA As reported previously, a few GSI/EnKF failures in the regression tests when GSI/EnKF are built with debug mode had been investigated . It was finally concluded they are from EMC GSI .( Some already have fix in other PR and, one in EnKF has a simple fix and another one with the letkf need more efforts ) and shouldn't affect this PR. What I am going to do is to check the rrfs comparison with this PR and then I will do a quick reviewing of the current codes. That is my plan and still can't guarantee I could make it before your due date. |
Thank you, @TingLei-NOAA , for the update. By what date do you anticipate completing your review? Knowing this helps the GSI Handling Review team plan their work as well as help @jderber-NOAA arrange his schedule. |
@RussTreadon-NOAA Thanks for all your efforts. Just hope you also noticed there is a possible reproducibility in the global case @CatherineThomas-NOAA provided . On my side, I expect I could finish all my actions in one week if everything is going well. |
@TingLei-NOAA , as mentioned last Friday (1/19/2024) would you prefer to be removed as a reviewer from this PR? This allows you to continue your tests without needing to work outside of your normal workday schedule. You have a lot on your plate and we don't want you burning the candle at both ends to complete this PR by due date. |
@RussTreadon-NOAA Sure. Please go ahead. I will give updates on my verification results for this PR. |
The initial difference is in the radiances:
PR:
My original tests have been scrubbed, but I'll rerun a single case of each today and copy over the fort files. |
But...But I didnt change anything with the radiances. Sigh.Sent from my Verizon, Samsung Galaxy smartphone
-------- Original message --------From: CatherineThomas-NOAA ***@***.***> Date: 1/23/24 9:54 AM (GMT-05:00) To: NOAA-EMC/GSI ***@***.***> Cc: jderber-NOAA ***@***.***>, Mention ***@***.***> Subject: Re: [NOAA-EMC/GSI] Complete issue #607 (Optimization of multiscale regional runs) (PR #670)
@jderber-NOAA
The initial difference is in the radiances:
Control:
J term J
excess moisture 3.8458219840846913E-01
surface pressure 1.1982398576591007E+04
temperature 8.8339292197175426E+04
wind 1.9503807419532203E+05
moisture 8.2165047114701410E+03
sst 3.3361437026688859E+03
ozone 1.8475954283744115E+04
gps bending angle 2.5125135570621365E+05
radiance 1.0160292865790830E+06
-----------------------------------------------------
J Global 1.5926693945344666E+06
PR:
J term J
excess moisture 3.8458219840846913E-01
surface pressure 1.1982398576591007E+04
temperature 8.8339292197175426E+04
wind 1.9503807419532203E+05
moisture 8.2165047114701410E+03
sst 3.3361437026688859E+03
ozone 1.8475954283744115E+04
gps bending angle 2.5125135570621365E+05
radiance 1.0160293280502578E+06
-----------------------------------------------------
J Global 1.5926694360056415E+06
My original tests have been scrubbed, but I'll rerun a single case of each today and copy over the fort files.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Cathy, Will run my full resolution tests again. Could you send your script (or one of your outputs) to me (or put it on Hera) so that I can make sure I am using the same satellite data types? John |
An update : the current PR produced the identical results, in terms of stats in the stdout, compared with the EMC GSI Develop branch in a RRFS parallel case using obs including sat and dbz. |
@jderber-NOAA @TingLei-NOAA My new tests using the latest version of the PR branch now produces identical initial cost functions. |
Thanks Cathy for looking into this and finding the issue!Sorry if I did not keep my branch up to develop and I made you waste time.It is a great relief that there was not a real issue.Best,JohnSent from my Verizon, Samsung Galaxy smartphone
-------- Original message --------From: CatherineThomas-NOAA ***@***.***> Date: 1/24/24 11:48 AM (GMT-05:00) To: NOAA-EMC/GSI ***@***.***> Cc: jderber-NOAA ***@***.***>, Mention ***@***.***> Subject: Re: [NOAA-EMC/GSI] Complete issue #607 (Optimization of multiscale regional runs) (PR #670)
@jderber-NOAA @TingLei-NOAA
I got to the root of the initial cost function differences. @RussTreadon-NOAA ran several cases on WCOSS2 as well and found identical cost functions. I also found another cycle that had identical cost functions with my same setup. We explored some of the differences between our configurations but the problem was with my executables. My clone of develop contained the Dec 20 update to setuprad.f90 while my clone of John's branch did not. Russ checked out a newer version of John's branch for his tests, which contained John's merge from develop.
My new tests using the latest version of the PR branch now produces identical initial cost functions.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve. Merge pending final peer review.
Thank you @shoyokota and @CatherineThomas-NOAA for your approvals. I'll coordinate with the Handling Review team to merge this PR into |
Thank you Russ too!Sent from my Verizon, Samsung Galaxy smartphone
-------- Original message --------From: RussTreadon-NOAA ***@***.***> Date: 1/26/24 9:44 AM (GMT-05:00) To: NOAA-EMC/GSI ***@***.***> Cc: jderber-NOAA ***@***.***>, Mention ***@***.***> Subject: Re: [NOAA-EMC/GSI] Complete issue #607 (Optimization of multiscale regional runs) (PR #670)
Thank you @shoyokota and @CatherineThomas-NOAA for your approvals. I'll coordinate with the Handling Review team to merge this PR into develop.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Description
This issue was intended to speed up regional multiscale runs. The initial parts of these changes were intended to speed up the initial processing of the observations. Unfortunately, I only able to complete changes to the reading of the observations.
The changes include the addition of the deallocates for many of the allocates in the reading routines, a restructuring of the thinning process and the addition of some new options for input pmot values.
The restructuring of the thinning process was to set flags for the data that was thinned and flags for data that was qc'ed in the read routine. Then based on the input pmot value the data was compressed to only retain the desired data. The pmot values are:
pmot = 0. Pass on all observations that fail the setup quality control, but not those that are thinned.
pmot = 1. Pass on all observations that fail the setup quality control and the obs that are thinned.
pmot = 2. Pass on only the observations that do not fail the quality control and are not thinned.
pmot = 3. Pass on only the observations that do not fail the quality control plus the obs that are thinned.
The pmot = 0 and 1 options are the same was in the read_satwnd routine before, but the change has been included for all routines that thin the input data using the convthin routines. The intent was to allow an option to minimize the data that is passed on to the setup routines if desired. This is similar to what is done for the global AVN runs as compared to the FNL runs, except is applicable to individual observation types.
The routines for which the above changes were made to are:
The values for pmot are defined as real numbers in the convinfo file (except where noted above).
In addition some other minor changes were made:
Please see issue #607 for more details.
Fixes #607
DUE DATE for merger of this PR into
develop
is 1/25/2024 (six weeks after PR creation).Type of change
Please delete options that are not relevant.
How Has This Been Tested?
These changes have been tested within the context of the global runs to ensure that they reproduce the current results.
Also, the regression ctests were run. When running the ctests it was noted that identical results were not found (minor round off level differences). After extensive debugging it was found that some of the reading routines (read_radar, read_fl_hdob and read_radar_wind_ascii.f90 were not properly implemented to ensure that the observation order was retained. In addition errors were found in the convthin and convthin_time routines. The convthin and convthin_time routines probably did not impact the results. All of the necessary changes are in Hera in /scratch1/NCEPDEV/da/John.Derber/converge4/fixdev.
Note after the control and new code produced ctest results that passed of all tests, the new code was modified to slightly change the order of the observations. The changes were reorder the observations so that when convinfo entries for the data were available the data was ordered such that the first obs are those which do not have a convinfo entry and then the observations are ordered in the order of the convinfo entries. Within each of these groups, the order of the reading of the data is retained. This reordering results in some roundoff changes to the regression tests.
A final run of the ctests on Hera resulted in three of the regression tests not passing, but for those three regression tests only round-off differences were found.
ctest results:
1/7 Test #3: rrfs_3denvar_glbens .............. Passed 668.02 sec
2/7 Test #4: netcdf_fv3_regional .............. Passed 967.17 sec
3/7 Test #7: global_enkf ...................... Passed 996.21 sec
4/7 Test #6: hafs_3denvar_hybens ..............***Failed 1288.89 sec
5/7 Test #5: hafs_4denvar_glbens ..............***Failed 1465.38 sec
6/7 Test #2: rtma ............................. Passed 1630.76 sec
7/7 Test #1: global_4denvar ...................***Failed 1746.12 sec
57% tests passed, 3 tests failed out of 7
For the failed regression tests, the differences are very small.
For global_4denvar the difference over the first 5 iterations is:
New:
< Initial gradient norm = 1.700751272515123674E+03
< cost,grad,step,b,step? = 1 0 6.584168406980355503E+05 1.700751272515123674E+03 1.057429134927263981E+00 0.000000000000000000E+00 good
< cost,grad,step,b,step? = 1 1 6.547802752899467014E+05 2.113894264916129487E+03 1.927405067994359289E+00 1.302727127739899959E+00 good
< cost,grad,step,b,step? = 1 2 6.461427496709105326E+05 1.349310521414032110E+03 2.811042534667704373E+00 1.212753185604616535E+00 good
< cost,grad,step,b,step? = 1 3 6.308338990252234507E+05 1.492308135350379871E+03 3.826902315539039368E+00 4.828500088351426189E-01 good
< cost,grad,step,b,step? = 1 4 6.206079687548581278E+05 1.732872322582532661E+03 2.853763847749684945E+00 1.567453853347285486E+00 good
< cost,grad,step,b,step? = 1 5 6.086367270158508327E+05 2.052318345404916272E+03 2.200825030950578132E+00 1.123654808458361520E+00 good
Control:
The penalties are identical for the first 3 iterations and the rest of the differences are only in the last few digits.
hafs_3denvar_hybens:
Control:
< cost,grad,step,b,step? = 1 0 1.484488764136420796E+05 5.220586731636382865E+03 3.266670310728393423E-01 0.000000000000000000E+00 good
< cost,grad,step,b,step? = 1 1 1.395457213798956072E+05 3.987278541138146920E+03 1.020272669622300121E+00 5.833302794603654196E-01 good
< cost,grad,step,b,step? = 1 2 1.214543780598063022E+05 2.282966557825236123E+03 1.175497927720148050E+00 3.579527393674070690E-01 good
< cost,grad,step,b,step? = 1 3 1.148424922169799393E+05 2.138015369099894997E+03 9.466815484382745671E-01 8.870869467349256077E-01 good
< cost,grad,step,b,step? = 1 4 1.101613679136846185E+05 3.525710978798607357E+03 7.850637926717303205E-01 2.702293955513371593E+00 good
< cost,grad,step,b,step? = 1 5 1.004694635052513186E+05 1.314894254315365288E+03 4.282579905121543185E+00 1.420689505641469230E-01 good
New:
The penalties are the same in iterations 1,2 and 5. Other differences are small
hafs_4denvar_glbens:
Control:
New:< cost,grad,step,b,step? = 1 0 1.098322171332156431E+05 3.099058770311941316E+03 8.407846553134072920E-01 0.000000000000000000E+00 good
All values (grad,step,b) are the same in iteration 0. The penalties are the same in 0,1,2, and 3. Other differences are small.
Since the code was tested with the same order of observations and produced no differences and the only differences are in the final digits, I conclude that the differences are only round-off due to different ordering of observations from the read_ routines.