Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI building issues with HDF5 1.14.0 #563

Closed
junwang-noaa opened this issue Apr 25, 2023 · 9 comments · Fixed by #624
Closed

GSI building issues with HDF5 1.14.0 #563

junwang-noaa opened this issue Apr 25, 2023 · 9 comments · Fixed by #624

Comments

@junwang-noaa
Copy link

The library team is trying to update HDF5 from the current 1.10.6 to new version 1.14.0 which contains the parallel netcdf bug fixes. However the initial test GSI built with HDF5 1.14.0 failed (please see comments from George V. in ufs-community/ufs-weather-model#1621). Could someone from GSI group to take a look at this?

@junwang-noaa
Copy link
Author

@Hang-Lei-NOAA @AlexanderRichert-NOAA would you please provide the module files for HDF5 1.14.0 related libraries? Thanks

@AlexanderRichert-NOAA
Copy link
Contributor

AlexanderRichert-NOAA commented Apr 25, 2023

On Acorn, to use HDF5 1.12.2:
/lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.3.0/envs/unified-env-compute-hdf5-1.12.2/install/modulefiles/Core
and to use HDF5 1.14.0:
/lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.3.0/envs/unified-env-compute/install/modulefiles/Core

Add those to $MODULEPATH (module use ...) and load the stack-intel and stack-cray-mpich modules as needed.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Apr 25, 2023 via email

@dtkleist
Copy link
Contributor

@arunchawla-NOAA -- I believe you have someone to assign to this issue, correct?

@arunchawla-NOAA
Copy link

Yes. Let me get back on this

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Jul 7, 2023

@natalie-perlin and I have made some progress on this. Starting with the branch RussTreadon-NOAA:intel2022, I updated the hpc-stack location and hdf5/netcdf versions then ran regression tests, comparing against @RussTreadon-NOAA's branch as a baseline. All hdf5/1.14.0 tests completed, but some of the hdf5/1.10.6 tests stalled and/or ran into time limits (global_3dvar, global_4dvar, and global_4denvar). Also, multiple tests produced different analysis results, which I have not analyzed in detail, but are concerning as they differ with the same hdf5/1.14.0 executable between loproc and hiproc tests (hwrf_nmm_d2 and d3, netcdf_fv3_regional, rrfs_3denvar_glbens, and rtma).

I ran similar tests on Hera and @natalie-perlin ran them on Gaea. Hera ran to completion (though I do not have the test results anymore, but will rerun them now that Hera is back up from maintenance), while Gaea crashed with hdf5/1.14.0 for the global_3dvar and global_4denvar tests.

Not that to run the tests with different modulefiles, I used a method described by @RussTreadon-NOAA to load the appropriate modulefiles at run time by modifying sub_jet as follows:

 myuser=$LOGNAME
 myhost=$(hostname)

+exp=${jobname}
+if [[ ${exp} == *"updat"* ]]; then
+   modulefiles=/mnt/lfs1/NAGAPE/epic/David.Huber/GSI/gsi_hdf5.14/modulefiles
+elif [[ ${exp} == *"contrl"* ]]; then
+   modulefiles=/mnt/lfs1/NAGAPE/epic/David.Huber/GSI/gsi_22/modulefiles
+fi
+
+
 DATA=${DATA:-$ptmp/tmp}

 mkdir -p $DATA
@@ -126,7 +135,7 @@ echo "" >>$cfile
 echo ". /apps/lmod/lmod/init/sh"                           >> $cfile
 echo "module purge"                                        >> $cfile
-echo "module use $gsisrc/modulefiles"                      >> $cfile
+echo "module use $modulefiles"                             >> $cfile
 echo "module load gsi_jet" >> $cfile
 echo "module list"                                         >> $cfile

@DavidHuber-NOAA
Copy link
Collaborator

On Hera, all tests pass except global_4dvar, global_4denvar, and global_3dvar:

global_4dvar fails due to different siginc files between the loproc_updat and loproc_contrl, which I will investigate further
global_4denvar and global_3dvar fail due to maximum memory threshold exceedance, which are non-critical.

@DavidHuber-NOAA
Copy link
Collaborator

On further investigation, the loproc_contrl and loproc_updat siginc files generated in the global_4dvar step are slightly different sizes (39487168 vs 39483763 bytes) and appear to contain different header information, but when compared with nccmp, the data, metadata, and encoding are identical, thus I believe this is a false positive.

@DavidHuber-NOAA
Copy link
Collaborator

I found an issue in gsi-ncdiag where allocating the HDF5 chunk size when opening a netCDF file in append mode to 16GB causes maxmem failures. This is a problem with HDF5 1.14.0, but not 1.10.6. A new version of gsi-ncdiag will need to be installed on all platforms under spack-stack to resolve this issue. NOAA-EMC/GSI-ncdiag#7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants