Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFDL to main (2023-04-06) #1599

Merged

Conversation

marshallward
Copy link
Collaborator

@marshallward marshallward commented Apr 6, 2023

This pull request contains several new features and bugfixes, alongside a very large number of commits related to standardization of dimensional analysis. Despite the size, these should have no impact on any experiments beyond adjustments to configuration parameter files.

Note that this patch introduces the new netCDF I/O layer to handle simple operations which do not require - or simply do not work with - the FMS I/O layer. Reviewers should carefully verify any experiments which depend on datasets, particularly anything which may be outside of our standard test suite.


Features

Bugfixes

Refactor

Build/Test

Admin

Contributors:

Hallberg-NOAA and others added 30 commits October 28, 2022 13:23
  Created the new module remapping_attic to hold older versions of remapping
code that are no longer used by MOM6.  The subroutines is PosSumErrSignificant,
remapByProjection, remapByDeltaZ and integrateReconOnInterval were moved to
remapping_attic, where they can be tested by calling remapping_attic_unit_tests.
The hard-coded old_algorithm logical module variable and the code it wraps were
also eliminated.  Also added a schematic description of the units of the real
variables in the various routines in MOM_remapping and corrected some spelling
errors.  All answers are bitwise identical.
  Moved interpolate_column and reintegrate_column (without changing anything)
from MOM_diag_vkernels.F90 to MOM_remapping.F90 and incorporated the tests
that had been in diag_vkernels_unit_tests into remapping_unit_tests.  The
entire MOM_diag_vkernels.F90 file was then removed.  All answers are bitwise
identical, although the module for two public routines was changed and a third
was eliminated.
  Remove missing_value arguments to interpolate_column and reintegrate_column,
instead using 0 for the values in vanished cells.  This change helps to address
github.com/mom-ocean/issues/769.  Also added comments schematically
describing some of the argument units.  Because 0 was already being used for the
missing value (except in unit tests), all solutions are bitwise identical.
  Added the new subroutine check_remapped_values with the duplicative error
checking code in remapping_core_h and remapping_core_w, both to reduce code
volume and promote code coverage, and to make the substance of these two
routines easier to follow.  All answers are bitwise identical.
- Adds `.gitlab/pipeline-ci-tool.sh` to enact most of the stages of the gitlab CI pipeline
  - enables interactive/command-line reproduction of the pipeline
  - `.gitlab/pipeline-ci-tool.sh` is documented in .gitlab/README.md with instructions
    on how to use at the command line and what each function is doing
- All commands formerly in .gitlab-ci.yml are now one-line invocations of `.gitlab/pipeline-ci-tool.sh`
  so .gitlab-ci.yml is now considerably smaller and easier to read with statements like
  `.gitlab/pipeline-ci-tool.sh mrs-compile debug_gnu` or `.gitlab/pipeline-ci-tool.sh check-params gnu`
- Previously, all results were compared again the stored regression answers. This meant that
  any error (e.g. layout) would show up as a fail of all types. We use the regression answers to
  check the repro-symmetric mode and then compare everything else to repro-symmetric or other results
  as appropriate. This allows us to distinguish between types of errors. The GH actions are doing it
  this way, and we originally did this in the first forms of the pipeline, but in the last re-factor
  I lazily switched to using the regression answers for everything.
- The subroutine categorize_axes cannot find the axes in ice restart files and gives warnings
WARNING from PE     0: categorize_axes: Failed to identify x- and y- axes in the axis list (xaxis_1, yaxis_1, Time) of a variable being read from INPUT/ice_model.res.nc
- This leads to an incorrect initializations and a subsequent sat.vap.press.overflow crash when using
infra/FMS2
- Same experiment runs fine with infra/FMS1
- After the fix the infra/FMS1 and infra/FMS2 answers are bitwise
identical
- The subroutine categorize_axes cannot find the axes in ice restart files and gives warnings
    WARNING from PE     0: categorize_axes: Failed to identify x- and y- axes in the axis list (xaxis_1, yaxis_1, Time) of a variable being read from I
NPUT/ice_model.res.nc
- This leads to an incorrect initializations and a subsequent sat.vap.press.overflow crash when using
    infra/FMS2
- Same experiment runs fine with infra/FMS1
- After the fix the infra/FMS1 and infra/FMS2 answers are bitwise
    identical
Added a line initializing the string Cartesian to a blank string in categorize_axes, so that it not be uninitialized when it is used a few lines later.
  Set the interpolation weights inside of interpolate_column to explicitly be
the complement of one another, thereby saving an extra division at each point
and reducing the number of variables that need to be stored, in preparation for
the creation of a separate subroutine to find interface positions.  This commit
is mathematically equivalent to what was there before, and the extensive unit
testing of interpolate_column is still passing, but it changes the value of some
interpolated interface diagnostics at the level of roundoff (but not the MOM6
solutions themselves, as they do not depend on interpolate_column yet).
This patch introduces a new autoconf macro, AX_FC_CHECK_C_LIB, which
confirms if the Fortran compiler can be linked to the netCDF C library.
As with other netCDF tests, the nc-config tool is used if necessary (and
available).

This resolves some recent issues on platforms where netCDF and
netCDF-Fortran are installed in separate locations, with different
library directories (-L).

It also resolves some false assumptions in configure.ac which presumed
equivalent access by the configured C and Fortran compilers.
Previously, we would test if the C compiler could be linked to netCDF,
and then assume that the Fortran compiler shared the same relationship.
We now use the Fortran compiler for both C and Fortran tests.

This patch fixes many issues observed on MacOS systems, including some
persistent problems on the GitHub Actions MacOS tests.  For example, we
can now use the default GCC 12 compilers, rather than forcing a rollback
to GCC 11.
This patch fixes some issues with testing of C bindings in Fortran.
Specifically, some tests are using a C compiler which may be
unconfigured, causing unexpected errors.

The autoconf script now uses the Fortran compiler to test these
bindings, rather than using the C compiler to test for their existence.
A new macro (AX_FC_CHECK_BIND_C) was added to run these tests.

This achieves the actual goal (test of Fortran binding) on top of the
original goal (availability of C function), while ensuring that the actual
compiler of interest (FC) is used in the test.

Two C-based tests are still present in the script for testing the size
of jmp_buf and sigjmp_buf.  The C compiler is now configured with the
AX_MPI macro, and is only used to determine the size of these structs.
* Setup OBC segments for COBALT/OBGC tracers

    - These are updates required to setup OBC segments for OBGC tracers.
    - Since COBALT package has more than 50 tracers using the MOM6 table
      mechanism for setting up OBC segments is not feasible. Rather, this
      update delegates such setup to mechanims used in ocean_BGS tracers
      leaving MOM6 mechanism for native tracers intact.
    - Fixed issues caught by MOM6 githubCI

* Add capability to change obc segment update period

- COBALT tracers do not need as frequent segment bc updates and can
  use a larger update period to speed up the model.
  This commit introduces a new parameter DT_OBC_SEG_UPDATE_OBGC
  that can be adjusted for obc segment update period.
- This commit applies the change only to BGC tracers but can easily
  be changed to apply for all.

* Insert missing US%T_to_sec

- The unit conversion factor was missing causing a crash in a newer test.

* Updates from Andrew Ross

- Avoid low initial values in the tracer reservoirs

* Per Andrew Ross review

* corrected indentation per review

* Avoid using module vars per review request

- Reviewer asked to avoid using module variables with "save" attributes.
- This commit hides the module variables inside the existing OBC type.

* Coding style corrections per review

* Modification per review: do_not_log if .not.associated(CS%OBC)

Co-authored-by: Robert Hallberg <Robert.Hallberg@noaa.gov>
In this PR an option is added to use ice viscosity computed from the observed surface velocity, computed by the model and use a constant value (for debugging purposes). A new (char) parameter "ICE_VISCOSITY_COMPUTE" is introduced; its values can be "MODEL" (the ice viscosity computed by the model); "OBS" the ice viscosity is computed at the preprocessing step and read from a file (its name is defined by the parameter "ICE_STIFFNESS_FILE") into a variable with a name defined by "A_GLEN_VARNAME" parameter; "CONSANT" is a constant value defined by a parameter "A_GLEN". These changes are in MOM_ice_shelf_dynamics.F90. Minor changes are done to MOM_ice_shelf_initialize.F90 to correct units, scales.
  Added calls to get_param to set 12 input variable names in files via runtime
parameters, including TIDEAMP_VARNAME, TEMP_COORD_VAR, SALT_COORD_VAR,
THICKNESS_IC_VAR, INTERFACE_IC_RESCALE, TEMP_IC_VAR, SALT_IC_VAR, BASIN_VAR,
TIDAL_DISSIPATION_VAR, ROUGHNESS_VARNAME, TIDEAMP_VARNAME and KH_BG_2D_VARNAME.
Also added two new runtime parameters, THICKNESS_IC_RESCALE and
INTERFACE_IC_RESCALE, to allow input thickness and interface height fields to be
rescaled.  A number of spelling errors in comments or output messages in the
files that were being modified as a part of this commit, including changes in
the documentation that appears in MOM_parameter_doc files.  All answers are
bitwise identical, but there are new entries and minor changes in many
MOM_parameter_doc files.
  Added calls to get_param to set 4 more input variable names in files via
runtime, including U_IC_VAR, V_IC_VAR, OPEN_DY_CU_VAR and OPEN_DX_CV_VAR.  Also
added or amended comments describing internal variables to describe their units
more consistently in MOM_shared_initialization.  All answers are bitwise
identical, but there may be new entries in some MOM_parameter_doc files.
  Corrected a bug in converting depths read from an input file from units of cm
to m when the ER03 version of tidal mixing is used.  This commit will change
answers when INT_TIDE_DISSIPATION = True, USE_CVMix_TIDAL = True, and
TIDAL_ENERGY_TYPE = "ER03".  There are no such configurations in the
MOM6-examples pipeline tests, and it is not clear whether or where such a
configuration has ever been used.

  This bug was introduced into dev/gfdl on Nov. 19, 2018 as a part of PR mom-ocean#883 in
commit 967e470, which was supposed to
be a refactoring of this portion of the code without changing answers, but
introduced this bug.  This commit should restore solutions with impacted
configurations to what they would have been before that earlier commit.
This patch removes the `build_{grid,data}.py` scripts from .testing's
tc4, along with the setup of the Python infrastructure used in the
.testing Makefile and GitHub Actions CI.

The Python scripts have been replaced with equivalent Fortran programs
which generate identical netCDF output.

A new rule (`preproc`) has been added to the .testing top Makefile for
generating the model input files.

The netCDF compiler dependenices are configured with autoconf, currently
duplicating the macros in `ac/configure.ac`. (NOTE: It may be possible
to share these with a common macro in ac/m4.

The configure script and Makefile are currently generated from
`configure.ac` and `autoreconf`.  In the future, we could simply
pre-generate `configure` and add it to the repository.

This patch was motivated by the inability to install recent
netCDF-Python packages on systems with older gcc compilers, including
our main production machine.  We could have possibly resolved this by
adding compiler configuration to pip, or perhaps reported the issue to
the netCDF-python project for them to resolve.  But the costs of relying
on all this Python infrastructure is starting to exceed the benefits,
and I would recommend we excise it from our test suite.
GitLab CI includes the internal testing suite (.testing) and included an
explicit setup of the Python environment (`make work/local-env`).  The
rule has since been removed, and the command now fails.

This patch removes those steps, since we no longer use Python in the
tests.

It also slightly reworks the reporting of test output.  Instead of
re-running `make test`, it uses the `make test.summary` rule to report
the final result.
  Added a new logical argument to interpolate_column to specify whether the
interpolated interface values outside of massless layers should be masked to
zero.  Also refactored the code in interpolate_column to separate out the
determination of the interface position from the interpolation and the masking
to facilitate the extension of this code to use higher order interpolation in
planned subsequent changes.  All answers are bitwise identical, although there
is a new mandatory argument for a public interface.
  Added ALE_remap_interface_vals and ALE_remap_vertex_vals to handle the
interpolation of variables that are at the interfaces atop tracer cells or above
the corners of the tracers cells from one grid to another.  Because these are
not yet used (but have been tested in calls that will be added with the next
commit) all answers are bitwise identical, but there are two new publicly
visible routines.
  Added REMAP_AUXILIARY_VARS to control whether to remap the accelerations that
are used in the predictor step of the split RK2 time stepping scheme.  Also
added the new routines remap_dyn_split_rk2_aux_vars, remap_OBC_fields and
remap_vertvisc_aux_vars to do the remapping, and code to call these routines
when REMAP_AUXILIARY_VARS is true. By default, REMAP_AUXILIARY_VARS is false,
and all answers are bitwise identical, but the entire MOM6-examples regression
suite has been run with this set to true, and they do appear to give physically
plausible answers in all cases, partially addressing the issue noted at
github.com//issues/203.  New entries are added to the
MOM_parameter_doc files, and there are three new publicly visible routines, but
by default answers do not change.
* Adds the option to set the diffusivity KHTH as horizontally varying
* Can be enabled via READ_KHTH = True, filename is provided by user via KHTH_FILE
* Will return error if user sets both READ_KHTH = True and KHTH > 0
* full file path is now set as INPUTDIR/KHTH_FILE, where both
  INPUTDIR and KHTH_FILE are runtime parameters
thickness diffusivity --> isopycnal height diffusivity
  Corrected the units written to the output files for 4 diagnostics (CAu_Stokes,
CAv_Stokes, area_shelf_h and sfc_mass_flux) and added missing units arguments to
the get_param calls for some (mostly unlogged) parameters.  The logged calls
where units are added include those for EKE_MAX, NDIFF_DRHO_TOL, NDIFF_X_TOL,
and IMPULSE_SOURCE_TIME, while some unnecessary carriage returns were removed in
the descriptions of some of these and closely related parameters.  Also added
units to the comment describing the AGlen argument to initialize_ice_AGlen.  All
answers are bitwise identical, but there can be minor changes in the metadata of
some files, and some MOM_parameter_doc and available_diags files might exhibit
minor changes.
  Added a missing scale factor in the DENSE_WATER_EAST_SPONGE_SALT get_param
call in dense_water_initialize_sponges, and added comments describing the local
variables (and their units) throughout the dense_water_initialization module.
The variable set by DENSE_WATER_SILL_HEIGHT was unused and it probably was
always intended to be DENSE_WATER_SILL_DEPTH, which it now is.  Units arguments
were also added to two of the unlogged get_param calls in this module.  Without
this change, this test case would not reproduce with dimensional rescaling due
to a scale factor that was omitted when salinity was being rescaled on May 3,
2022, which became a part of PR #122 to dev/gfdl, but answers should not change
when dimensional rescaling of salinities is not used.  All answers and output in
the MOM6-examples test suite are bitwise identical.
  Removed meaningless units arguments from 31 get_param calls for integer,
character, or logical parameters.  All answers are bitwise identical, but some
entries in the various parameter_doc files are changed.
@jiandewang
Copy link
Collaborator

now back to business (after almost 48 hours of travel). I ran one of UFS coupled case on wcoss2 (production machine) but found out that results are different from current baseline. Differences start from "v Predictor accel [uv]_accel_bt", see line 1705 in the "err" files. I put run log, restart and output files on GAEA at /lustre/f2/dev/ncep/Jiande.Wang/MOM6-update/20230406/M6-20230406. Files with "BM" are from current baseline runs, while "PR" are from this PR runs. Since this PR code works fine in at least two of our R&D machines, I suspect something related to precision on wcoss2.

@marshallward
Copy link
Collaborator Author

@jiandewang I looked at this with @Hallberg-NOAA and we believe that a likely source is the mixed layer restratification.

The diffs in the barotropic solvers are in the halos of the data domain, rather than the compute domain, and are probably due to fixes in the checksum calculations, which had the incorrect halo widths.

The first compute-domain divergence is in Post-mixedlayer_restrat h. We can't say a whole lot without looking at our end, but we found a possible source of divergence:

Old:

h_vel = 0.5*((htot_fast(i,j) + htot_fast(i+1,j)) + h_neglect) * GV%H_to_Z
mom_mixrate = vonKar_x_pi2*u_star**2 / &
(absf*h_vel**2 + 4.0*(h_vel+dz_neglect)*u_star)
timescale = 0.0625 * (absf + 2.0*mom_mixrate) / (absf**2 + mom_mixrate**2)
timescale = timescale * CS%ml_restrat_coef

New:

h_vel = 0.5*((htot_fast(i,j) + htot_fast(i+1,j)) + h_neglect) * GV%H_to_Z
timescale = growth_time(u_star, h_vel, absf, dz_neglect, CS%vonKar, CS%Kv_restrat, CS%ml_restrat_coef)

! This case reproduces the previous answers, but the extra h_neg is otherwise unnecessary.
mom_mixrate = (pi2*vonKar)*u_star**2 / (absf*hBL**2 + 4.0*(hBL + h_neg)*u_star)
growth_time = restrat_coef * (0.0625 * (absf + 2.0*mom_mixrate) / (absf**2 + mom_mixrate**2))

It's not clear why this could cause a regression, since it's only a refactor, but it's possible that an aggressive compiler could have reordered the multiplication, or moving into a separate function suppressed some undesired optimization.

There is also enough going on in this file that some kind of regression could have happened.

If I get you a modified version of this file without the code change, would you be able to test it? (Unfortunately it's not as simple as reverting the file, since there was an API change.)

If you have the compiler version and flags used on WCOSS-2, that might also help troubleshoot the problem on our end.

@jiandewang
Copy link
Collaborator

@marshallward and @Hallberg-NOAA thanks for the quick response, yes I will be able to test if you have a modified file for me. I suspected this commit is the cause but had code conflict when doing revert. Note we have all different setting of MOM in UFS, all of them had this issue except our super coarse resolution (5x5 ocean), in that setting, mixed layer is turned off. I will provide compiling flag shortly

@jiandewang
Copy link
Collaborator

@marshallward I put compiling output file at /lustre/f2/dev/ncep/Jiande.Wang/MOM6-update/20230406/M6-20230406, see "compiling-output" line 798 for an example of flags being used

@marshallward
Copy link
Collaborator Author

marshallward commented Apr 17, 2023

Compiler was Intel Fortran 19.1.3, flags are -g -traceback -i4 -r8 -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -sox -O2 -debug minimal -fp-model source -module mod.
This is very close to what we use, so I don't think we can attribute it to the compiler itself.

@gustavo-marques
Copy link
Collaborator

We are also seeing a change in answers for the CESM tests.

@marshallward
Copy link
Collaborator Author

@gustavo-marques Do you know if the diff is in MOM_mixed_layer_restrat?

Also it seems like WCOSS-2 is AMD, is your machine AMD or Intel?

@jiandewang
Copy link
Collaborator

@marshallward I am 99% sure the modified code you provided to me solved the problem. Now I got identical results on wcoss2, I am repeating testing on R&D machines but there is a long waiting job queue at this moment, will update once jobs are done.

@marshallward
Copy link
Collaborator Author

marshallward commented Apr 20, 2023

@jiandewang Thanks, that is good news although I think we need to understand what might have caused the problem. I wonder if it is related to the AMD chipset.

@gustavo-marques I can provide the same modified mixed layer restratification file if you are ready to test it.

@gustavo-marques
Copy link
Collaborator

@gustavo-marques Do you know if the diff is in MOM_mixed_layer_restrat?

Also it seems like WCOSS-2 is AMD, is your machine AMD or Intel?

@marshallward, I have not looked close enough to find out if the difference is coming from MOM_mixed_layer_restrat.
Cheyenne uses Intel.
I am happy to test the modified mixed layer restratification file.
Thanks!

@marshallward
Copy link
Collaborator Author

I tested this on C5 (our AMD cluster) but did not detect any regressions. So it's not a simple matter of Intel vs AMD architecture. We are using a different compiler from WCOSS (Intel 2022 vs 2019) but unfortunately 2022.02 is all we have available on C5. We could get into specifics (C5 is EPYC 7H12) but I doubt that is the reason.

It was suggested Monday that there are differences in some layer-remapped diagnostics due to time-space average swapping, but it seems the consensus was that these answer changes are at the lowest bits and are acceptable. (Anyone please correct me if that is wrong.)

Unfortunately I don't think we have sufficient access to explain this regression, but for now we will revert MOM_mixed_layer_restrat.F90 and revisit this in the future.

Due to some machines reporting a regression in the mixed layer
restratification code, this patch reverts the calculation of the growth
time in a separate function.

Most of the content related to comments and parameter setup have been
retained, even if those parameters are no longer used.
Copy link
Collaborator

@jiandewang jiandewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the Reversion of MOM_mixed_layer_restrat growth_time, it passed UFS regression test on wcoss2 and two other R&D machines

@marshallward
Copy link
Collaborator Author

The reversion of MOM_mixed_layer_restrat.F90 was able to preserve the introduction of u_star_min, but has deferred the introduction of the growth_time function.

@marshallward
Copy link
Collaborator Author

@gustavo-marques @sanAkel Any update on this PR?

@sanAkel
Copy link
Collaborator

sanAkel commented May 3, 2023

@gustavo-marques @sanAkel Any update on this PR?

@marshallward I plan to test this today/tomorrow.

Copy link
Collaborator

@sanAkel sanAkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answers remain same, tests passed!

@marshallward thanks for your patience.

sanAkel pushed a commit to GEOS-ESM/GEOS_OceanGridComp that referenced this pull request May 4, 2023
@gustavo-marques
Copy link
Collaborator

Sorry for the delay. Our tests are passing (no changes in the ocean.stats).
I approve this PR.

@gustavo-marques gustavo-marques self-requested a review May 4, 2023 18:47
@marshallward marshallward merged commit 400bd21 into mom-ocean:main May 4, 2023
@marshallward marshallward deleted the dev-gfdl-main-candidate-2023-04-06 branch May 4, 2023 18:57
@marshallward
Copy link
Collaborator Author

Thanks to all for testing and working through the revisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.