Building with the spack-stack unified environment on non-production machines #589

DavidHuber-NOAA · 2023-07-13T11:57:47Z

With the upgrade to HDF5 version 1.14.0 ongoing (#563), this would be a good time to move to the spack-stack unified environment, which also incorporates this version of HDF5 and netCDF version 4.9.2. Some additions are needed to the unified environment (in particular, ncio, ncdiag, and bufr), but these can be added manually as they already exist in spack. This is also desired for the upcoming UFS SRW release, which will incorporate the GSI.

DavidHuber-NOAA · 2023-07-13T12:55:07Z

I have established a testing environment on Hera referencing a test build of the spack-stack unified environment. So far, this is working well. All regression tests run to completion. I am running the regression tests against the branch to be merged in #571. There are some differences between the two, with the hwrf, global_4denvar, global_3denvar, global_3dvar, and rtma tests all failing cost and/or penalty comparison tests, with values differencing in the 1-5% range on the first iteration.

One potential source for differences is a newer version of the BUFR library. The spack-stack test install is running version 12.0.0 versus the hpc-stack-based install that is compiled against version 11.7.0. Running with the older version with spack-stack would determine if this is a source of difference.

DavidHuber-NOAA · 2023-07-14T12:44:45Z

@arunchawla-NOAA @RussTreadon-NOAA @hu5970 @ulmononian @mark-a-potts
I gave regression tests on Hera with spack-stack version 1.4.1 + ncdiag/1.1.1 and bufr/11.7.0. The large differences that I was seeing with bufr/12.0.0 are now on the order of 10^4 smaller, so it seems that the new version of bufr is producing quite different results. Is it known if this difference is expected? And if so, would it be desirable to upgrade to bufr/12.0.0 at this time or should I continue incorporating spack-stack with bufr/11.7.0?

RussTreadon-NOAA · 2023-07-21T02:22:19Z

Dave ran ctests on Hera using develop built with bufr/11.7.0 and develop built with bufr/12.0.0. He reported large differences in the minimization.

Output from the loproc global_4denvar for updat (bufr/12.0.0) and contrl (bufr11.7.0) was examined. Initial differences are limited to prepbufr temperature observations.

The initial temperature (obs-ges) innovation statistics (fort.203) differ for prepbufr report types 120, 132, 180, and 182. These report types are virtual temperatures. Bias and rms are larger for observations closer to the surface. Differences decrease moving up in the atmosphere. Innovation statistics for report types 130, 131, 133, 134, 136, and 136 are identical. These report types are sensible temperatures.

This suggests something odd happens when processing virtual temperature observations with bufr/12.0.0.

Prints added to src/gsi/read_prepbufr.f90 reveal that when gsi.x is built with bufr/12.0.0 the following call to ufbqcd

! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',vtcd)

returns vctd=0.395253-322 for sonde temperature obs. (I used g13.6 formatting and 0.395253-322 is printed to stdout)

Recompiling the code with bufr/11.7.0 and running the same case returns vtcd=8.00000 for sonde temperature obs.

Do we need to change or modify the call to ufbqcd or is a different code change needed when moving to bufr/12.0.0?

RussTreadon-NOAA · 2023-07-21T10:46:49Z

Write vtcd without formatting. bufr/12.0.0 run has 3.952525166729972E-323.

Replace

 ! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',vtcd)

with

 ! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',ivtcd)
  vtcd = ivtcd

in src/gsi/read_prepbufr.f90. Recompile gsi.x with bufr/12.0.0 and run global_3dvar ctest. This time get expected value of 8.00000000000000 for vtcd.

The global_3dvar ctest failed

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R global_3dvar
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 1: global_3dvar
1/1 Test #1: global_3dvar .....................***Failed  2039.84 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 2039.92 sec

The following tests FAILED:
          1 - global_3dvar (Failed)

This failure, however, was due to the maximum allowable threshold time check. The updat bufr/12.0.0 gsi.x ran slower than the contrl bufr/11.7.0 gsi.x.

global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 321.045029
global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 358.756143
global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 474.135176
global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 512.837189

This may reflect the additional prints in the bufr/12.0.0 read_prepbufr.f90.

Comparison of the analysis results shows that the bufr/12.0.0 run reproduces the bufr/11.7.0 run. This result is reasonable. Changing the bufr library version should not change analysis results.

Not sure why replacing a real(8) argument with an integer(4) argument was needed in this call to ufbqcd. Similar changes were not made elsewhere. Thus, not confident the above change is the best change.

My working directories on Hera are

updat with bufr/12.0.0 in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12
contrl with bufr/11.7.0 in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/develop

global_3dvar ctest results are in /scratch1/NCEPDEV/stmp2/Russ.Treadon/bufr12/tmpreg_global_3dvar

Other ctests have not yet been run with the above change.

RussTreadon-NOAA · 2023-07-21T12:36:11Z

Hera ctests with bufr/12.0.0

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -j 9
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  974.84 sec
2/9 Test #5: hwrf_nmm_d3 ......................***Failed  992.72 sec
3/9 Test #7: rrfs_3denvar_glbens ..............   Passed  1037.49 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  1039.35 sec
5/9 Test #9: global_enkf ......................***Failed  1202.61 sec
6/9 Test #6: rtma .............................***Failed  2002.66 sec
7/9 Test #3: global_4denvar ...................***Failed  2224.99 sec
8/9 Test #2: global_4dvar .....................   Passed  2230.50 sec
9/9 Test #1: global_3dvar .....................***Failed  2392.88 sec

33% tests passed, 6 tests failed out of 9

Total Test time (real) = 2393.03 sec

The following tests FAILED:
          1 - global_3dvar (Failed)
          3 - global_4denvar (Failed)
          5 - hwrf_nmm_d3 (Failed)
          6 - rtma (Failed)
          8 - netcdf_fv3_regional (Failed)
          9 - global_enkf (Failed)

All failures except rtma were due to timing checks. Apart from rtma, bufr/12.0.0 analyses are identical with their bufr/11.7.0 counterparts. For several failed tests, bufr/12.0.0 executable ran slower than bufr/11.7.0 executable for one or more of the configurations. For example, global_3dvar

tmpreg_global_3dvar/global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 326.543222
tmpreg_global_3dvar/global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 359.565145
tmpreg_global_3dvar/global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 467.854069
tmpreg_global_3dvar/global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 509.382080

and global_4denvar

tmpreg_global_4denvar/global_4denvar_hiproc_contrl/stdout:The total amount of wall time                        = 373.331819
tmpreg_global_4denvar/global_4denvar_hiproc_updat/stdout:The total amount of wall time                        = 371.220896
tmpreg_global_4denvar/global_4denvar_loproc_contrl/stdout:The total amount of wall time                        = 385.547573
tmpreg_global_4denvar/global_4denvar_loproc_updat/stdout:The total amount of wall time                        = 434.971107

The rtma failure shows different initial innovation statistics for temperature report types 197 and 199. Find another call ufbqcd in read_prepbufr.f90

 !see if file contains GLERL program code (GLCD)
 !Obtain code if it exists.  Otherwise set to missing (-999)
   call status(lunin,lindx,idummy1,idummy2)
   call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
   if (glret /= 0) then
     call ufbqcd(lunin,'GLERL',glcd)

As a test, replace glcd with iglcd in calling list and set glcd to iglcd after call

 !see if file contains GLERL program code (GLCD)
 !Obtain code if it exists.  Otherwise set to missing (-999)
   call status(lunin,lindx,idummy1,idummy2)
   call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
   if (glret /= 0) then
     call ufbqcd(lunin,'GLERL',iglcd)
     glcd = iglcd

Recompile and rerun rtma test. This time the rtma test passed.

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R rtma
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 6: rtma
1/1 Test #6: rtma .............................   Passed  1278.39 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1278.40 sec

Stop here and let others follow up / investigate more.

DavidHuber-NOAA · 2023-07-21T12:43:11Z

Also tagging @jbathegit @edwardhartnett @wx20jjung @dtkleist for awareness.

arunchawla-NOAA · 2023-07-21T12:44:15Z

@DavidHuber-NOAA and @RussTreadon-NOAA thanks for your tests on this! For now I would focus on building and testing with spack-stack that has bufr/11.7.0 I am going to bring this to the attention of @jbathegit @jack-woollen and @edwardhartnett to see if they can throw more light on Russ's results

jbathegit · 2023-07-21T19:03:49Z

It looks like you figured the issue out on your own. This was indeed a planned change within v12.0.0, as noted in the release notes...

Subroutines ufbqcd() and ufbqcp() have been modified to pass call arguments containing event program codes as integers, rather than continuing to pass them as real numbers. [Issue #78]

This was done intentionally as part of the simplification down to one library build. Note that in previous versions of ufbqcd, the 3rd argument returned was just a real, not an explicit real*8. So it was only a real*8 in the _8 and _d builds because of the compile flags for those builds, and not because of some intrinsic need within the library itself. The return value of the 3rd argument is never larger than 255 so it fits easily inside of any real, and in fact it's always a whole number so there's really no reason not to return it as an integer and make everything simpler.

jack-woollen · 2023-07-23T15:14:12Z

I feel the need to contribute my two cents in supporting backward compatibility with downstream codes as an important principle in software management. Considering reasons not to make such a change as this, seems to me there's not a good enough compensating reason to do it.

jbathegit · 2023-07-26T13:51:24Z

I hear what you're saying Jack, and I totally agree that backwards compatibility is an important principle in software management. But in this case, this is no different than where we've already agreed that _8 users will now need to make a change to their application codes to add a call to setim8b, and to make sure that all calls to library functions which return integers are now explicitly declared as integer*4.

In other words, with a major release such as this (where the X of the version X.Y.Z is changing), it's understood that downstream users may need to make some adjustments to their application codes, and this ufbqcd change is just another example of that, and we've already documented this in the accompanying release notes for _8 and _d users.

Note also that, if we hadn't included this ufbqcd change, then as software managers we'd have had to instead develop new internal routines similar to x48 and x84, but for reals instead of integers, since this routine (along with its sister routine ufbqcp) were the only cases of user-callable routines where a real argument was being passed that wasn't already explicitly declared as real*8. And we couldn't have just changed the library to now make that 3rd argument an explicit real*8, because doing so would have similarly impacted any existing _4 users. So bottom line, no matter what we did here, it would have impacted some of our users, and this is just one of the consequences of our earlier decision to eliminate the previous _8 and _d builds and consolidate everything down to a single _4 build.

jbathegit · 2023-07-26T14:10:04Z

On a separate but related note, I did note the discussion where some of the spack-stack test cases appear to be running slower with the new 12.0.0 build than with the previous 11.7.0 build. That's definitely not a feature that we intentionally introduced into the new build, nor is it something that we saw in our own regression testing for the library, so if it's still an issue of concern then we'd definitely like to better understand what part(s) of the library may be responsible. Is there any way that some sort of profiling tool might better pinpoint where those differences are (e.g. in which library functions or lines of code) for your test cases?

…OAA-EMC#589

Also took the opportunity to update prod-util loads for each modulefile and delete the gsi_common_wcoss2.lua modulefile, which was not incorporated with the Intel 2022 upgrade PR. NOAA-EMC#589

DavidHuber-NOAA · 2023-09-14T15:38:52Z

I ran regression tests with spack-stack/1.4.1 on Hera against the current develop branch with the following results:

1/9 Test #7: rrfs_3denvar_glbens ..............   Passed  1293.51 sec
2/9 Test #4: hwrf_nmm_d2 ......................***Failed  1592.51 sec
3/9 Test #5: hwrf_nmm_d3 ......................   Passed  1601.61 sec
4/9 Test #9: global_enkf ......................   Passed  1616.37 sec
5/9 Test #8: netcdf_fv3_regional ..............***Failed  1651.17 sec
6/9 Test #6: rtma .............................   Passed  2195.24 sec
7/9 Test #3: global_4denvar ...................***Failed  2551.00 sec
8/9 Test #2: global_4dvar .....................***Failed  2608.58 sec
9/9 Test #1: global_3dvar .....................***Failed  2730.67 sec

Looking into the netcdf_fv3_regional case, differences first appear in initial radiance penalties. This seems like a CRTM issue, though I am having trouble tracking it down.

The CRTM directories are
hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0
spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl

The CRTM fix directories are
hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0/fix
spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0_emc-zvsufvh

The contents of the fix directories are identical. The hpc-stack source code for CRTM is located here: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/crtm-v2.4.0, which is identical to the version cloned by spack-stack (github.com/JCSDA/crtm version 2.4.0; hash 5ddd0d6).

So there doesn't seem to be an obvious difference in CRTM. The library versions do differ between the two stacks. In particular, hdf5/netcdf 1.14.0/4.9.2 and nemsio in spack-stack is still at 2.5.2 and includes w3nco, but I don't think these should have an impact.

hpc-stack

  1) miniconda3/4.12.0    9) netcdf/4.7.4  17) nemsio/2.5.4
  2) hpc/1.2.0           10) bufr/11.7.0   18) wrf_io/1.2.0
  3) intel/2022.1.2      11) bacio/2.4.1   19) ncio/1.1.2
  4) hpc-intel/2022.1.2  12) w3emc/2.9.2   20) crtm/2.4.0
  5) impi/2022.1.2       13) sp/2.3.3      21) ncdiag/1.1.1
  6) hpc-impi/2022.1.2   14) ip/3.3.3      22) gsi_common
  7) cmake/3.20.1        15) sigio/2.3.2   23) prod_util/1.2.2
  8) hdf5/1.10.6         16) sfcio/1.4.1   24) gsi_hera.intel

spack-stack

  1) intel/2022.1.2                   12) zstd/1.5.2            23) nemsio/2.5.2
  2) stack-intel/2021.5.0             13) netcdf-c/4.9.2        24) wrf-io/1.2.0
  3) impi/2022.1.2                    14) netcdf-fortran/4.6.0  25) ncio/1.1.2
  4) stack-intel-oneapi-mpi/2021.5.1  15) bufr/11.7.0           26) crtm-fix/2.4.0_emc
  5) miniconda/3.9.12                 16) bacio/2.4.1           27) git-lfs/2.10.0
  6) stack-python/3.9.12              17) w3emc/2.9.2           28) crtm/2.4.0
  7) cmake/3.23.1                     18) sp/2.3.3              29) gsi-ncdiag/1.1.1
  8) curl/7.29.0                      19) ip/3.3.3              30) gsi_common
  9) pkg-config/0.27.1                20) sigio/2.3.2           31) prod-util/1.2.2
 10) zlib/1.2.13                      21) sfcio/1.4.1           32) gsi_hera.intel
 11) hdf5/1.14.0                      22) w3nco/2.4.1

@RussTreadon-NOAA Would you mind taking a look at the RTs on Hera? My regression test directory is in /scratch1/NCEPDEV/nems/David.Huber/GSI/gsi_spackstack/build/regression and the working directories are in /scratch1/NCEPDEV/stmp/David.Huber/ptmp_spack.

RussTreadon-NOAA · 2023-09-14T16:00:28Z

A check of the run directories shows the crtm coefficients to be identical between the updat and contrl runs.

A check of fort.207 shows initial radiance penalty differences between the updat and contrl for the following satellite_sensors: amsua_n15, amsua_n18, amsua_n19, atms_npp, atms_n20. These are all microwave sensors. However, other microwave satellite_sensors have identical initial penalties between the updat and contrl (e.g., mhs_n19, amsua_metop-b, amsua_metop-c, mhs_metop-c).

Are the same compiler flags used to build the hpc-stack and spack-stack crtm libraries?

DavidHuber-NOAA · 2023-09-14T18:33:57Z

@AlexanderRichert-NOAA Could you point me to the build log for spack-stack/1.4.1 CRTM/2.4.0 on Hera? I'm seeing some differences between it and the hpc-stack version.

AlexanderRichert-NOAA · 2023-09-14T18:39:29Z

Sure-- /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl/.spack/spack-build-out.txt (and there's other build-related info in that directory)

AlexanderRichert-NOAA · 2023-09-14T18:46:15Z

I won't look at it any further unless you'd like me to, but for what it's worth the difference that stands out to me is the CMAKE_BUILD_TYPE (hpc-stack=Release, spack-stack=RelWithDebInfo), and we've seen some differences in other codes based on that setting.

jack-woollen · 2023-10-26T17:09:45Z

@jbathegit @edwardhartnett I just ran a simple test on wc2 reading a large satellite file, which is basically all the gsi does with bufrlib, and got much the same result David did in his tests. With bl11 it took 2m5s, with bl12 2m45s. Timing changes a little from run to run but it generally takes something like 40s longer with bl12. My test literally just read the data using ireadmg and ireadsb then stopped. I'll proceed to comment the im8b test blocks for routine involved in the read process and see what happens. The fix may just be something like separate paths through the lower level routines for 4 and 8 byte cases. Depends on how many routines that turns out to be.

jack-woollen · 2023-10-29T10:49:28Z

Moving the tests to hera, using gfortran, v12 was actually a little bit faster than v11. So how do you compile the bufrlib with ifort on hera?

DavidHuber-NOAA · 2023-10-30T11:37:36Z

@jack-woollen Hmm, interesting. I compiled version 12 of bufrlib two ways -- CMAKE_BUILD_TYPE="Release" (case 2 here) and CMAKE_BUILD_TYPE="RelWithDebInfo" (case 1). Both are slower than version 11.7.0 with CMAKE_BUILD_TYPE="Release" (case ctl).

edwardhartnett · 2023-10-30T12:13:17Z

Can this problem be demonstrated in a bufrlib test program?

That is, can we get a simple, one-file program, which runs slower under v12 than v11?

DavidHuber-NOAA · 2023-10-30T13:08:47Z

@edwardhartnett It sounded like Jack had done just that in this comment.

jack-woollen · 2023-10-30T14:10:48Z

Timing these codes can be fiddly, depending on what platform it tested on, and when. WCOSS2 has demonstrated widely varying timings, via the unix time command. At first the results looked like v12 was slower than v11. That's when I wrote the comment above. But when I ran it over and over and over the results became more murky. Then on Sunday I tried timing on hera where v12 was a couple seconds faster then v11, but compiling with gfortran. Them I went back to WC2 (and ifort) and got results like hera, v12 faster by a hair.

So far my simple timing test only checks reading a large and compressed (mtiasi) file. I'm thinking running the GSI observer with a full set of data will give a better comparison for that code. Working on it now . The elephant in the room for bufrlib timing may be the prepobs process, since it does reading, writing and arithmetic. I'll look at that also.

My question about about compiling bufrlib on hera with ifort, refers to when I download bufrlib from git, the cmake default is to use gfortran. I'm wondering what to change so it compiles with ifort. Any help with that is appreciated.

DavidHuber-NOAA · 2023-10-30T14:17:09Z

@jack-woollen Ah, understood. To compile with Intel on Hera, you only need to load the intel/2022.1.2 module then run cmake. Cmake will detect the Intel compiler and create a makefile based on it.

jack-woollen · 2023-11-01T15:50:14Z

Thanks to @DavidHuber-NOAA help with the gsi cmake setup I have been able to make timing comparisons of running the gsi observer mode with BF-11.7.0 and BF-12.0.0. The observer reads all the BUFR datasets into the program and then will stop if the variable miter=0. So it exercises most if not all the bufr coding within the gsi, and it only takes a few minutes to complete. Trying this and trying that has revealed what looks like roughly half of the timing issue apparent in the gsi runs using bufrlib version 12. Subroutine upb8.f, which is called when unpacking compressed bufr datasets, benefitted from some optimization specifically for the 4byte case. This subroutine is not introduced as part of the mods to the single library build, but rather to accomodate the new WMO approved feature of allowing numeric elements to fill up to 64 bits, instead of just 32 as it has been until now. The fix for upb8.f can be reviewed on dogwood in directory /lfs/h2/emc/global/noscrub/Jack.Woollen/bufrtime/bufr_v12.0.0/NCEPLIBS-bufr/src.

Following is a list of timing results before and after the mods to upb8.

Before upb8 change

The total amount of time in user mode = 268.026405 ~40s
The total amount of time in user mode = 308.623302

The total amount of time in user mode = 267.477490 ~42s
The total amount of time in user mode = 309.731031

The total amount of time in user mode = 267.551686 ~43s
The total amount of time in user mode = 310.306406

The total amount of time in user mode = 266.931843 ~50s
The total amount of time in user mode = 316.532808

The total amount of time in user mode = 268.897811 ~47s
The total amount of time in user mode = 315.577455

After upb8 change

The total amount of time in user mode = 268.612845 ~22s
The total amount of time in user mode = 290.555594

The total amount of time in user mode = 274.917766 ~14s
The total amount of time in user mode = 288.483719

The total amount of time in user mode = 265.291353 ~25s
The total amount of time in user mode = 290.776556

The total amount of time in user mode = 265.125857 ~25s
The total amount of time in user mode = 290.373777

The total amount of time in user mode = 267.488541 ~23s
The total amount of time in user mode = 290.313897

DavidHuber-NOAA · 2023-11-02T12:36:44Z

@jack-woollen Fantastic! Thanks for the update. I will run this optimization through the GSI regression tests.

jbathegit · 2023-11-02T17:59:20Z

@jack-woollen I also have to add my kudos - well done sir! Your change to add an nbits<=32 block to optimize performance for the 4-byte majority of cases makes perfect sense, and it looks like David's preliminary tests with your update are having the intended effect as well. Thanks again!

I'll go ahead and set up a PR to add this to the library baseline for the next release, and I'll let you know when it's ready.

jbathegit · 2023-11-02T18:08:29Z

BTW, and just for my own FYI, how were you doing your timings? Were you using calls to the intrinsic routines cpu_time and system_clock inside of the Fortran source code, or were you using something like the UNIX time command inside of a shell script? As you know, I've started working to add some timing tests to the library CI suite, so I'd like to replicate it using whatever best practices are being used by others for such testing in GitHub or elsewhere.

Or maybe this is more of a question for @DavidHuber-NOAA or others?

jack-woollen · 2023-11-02T19:02:35Z

@jbathegit For the gsi there is the timing printed from the w3 routine I think. For other test I use the unix time command. In either case the user time is more stable than the wall time for comparisons.

jbathegit · 2023-11-02T19:23:09Z

Thanks Jack.

Do you happen to know which w3 routine they're using? Maybe w3utcdat?

And by "user time", I presume you mean CPU time?

jack-woollen · 2023-11-02T20:26:57Z

Jeff, I don't know what timer is used in gsi. It isn't w3utcdat, at least I can't find that in the code. What ever it is it reports lke this:

RESOURCE STATISTICS**************
The total amount of wall time = 265.619982
The total amount of time in user mode = 292.463298
The total amount of time in sys mode = 5.330310
The maximum resident set size (KB) = 1827508
Number of page faults without I/O activity = 171892
Number of page faults with I/O activity = 24
Number of times filesystem performed INPUT = 1938274
Number of times filesystem performed OUTPUT = 121176
Number of Voluntary Context Switches = 6155
Number of InVoluntary Context Switches = 417
END OF RESOURCE STATISTICS********

Of course the unix timer reports:

real 0m0.012s
user 0m0.007s
sys 0m0.005s

In either case I think the user time is more stable than wall time. It is probably more like cpu time.

DavidHuber-NOAA · 2023-11-08T20:35:13Z

I ran regression tests again spack-stack 1.5.1 on Orion. Multiple tests failed timethresh tests. Following the suggestion in NOAA-EMC/global-workflow#2044, I tried adding --cpus-per-task to the srun call, but this still did not help. Additionally, the contrl tests run significantly faster (in some cases >2x faster) than the updat. For example, the timings for hafs_3denvar_hybens were

hafs_3denvar_hybens_loproc_contrl   05:55
hafs_3denvar_hybens_loproc_updat   12:31

The global_enkf and hafs_4denvar_glbens tests showed similar slowdowns.

I then ran regression tests on Hera using spack-stack/1.5.1. No tests failed for timethresh exceedances. It seems that the slowdowns seen on Orion are relegated to just that system and only for spack-stack/1.5.1.

However, 4 tests failed for non-reproduceable cost function results on both systems (netcdf_fv3_regional, hafs_3denvar_hybens, hafs_4denvar_glbens, and global_4denvar). These will need to be investigated.

jack-woollen · 2023-11-09T13:54:22Z

@DavidHuber-NOAA
Something obviously not right here. Have you replicated the last round of regression tests on hera? Just curious.

FYI, I copied the latest souped up test version of bufr_v12 over to hera and compiled it with intel/2022.1.2, which is consistent with the bufr/11.7.0 module, and ran the time test reading through a 827MB mtiasi file. The results are below. They are a little better than I see on wcoss2. The updated v12 bufr working set is found on hera in /scratch1/NCEPDEV/global/Jack.Woollen/bufrtime/bufr_v12.0.0.

/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-2022.1.2/bufr/11.7.0/lib64/libbufr_4.a
real 2m0.061s
user 1m59.297s
sys 0m0.275s

/scratch1/NCEPDEV/global/Jack.Woollen/bufrtime/bufr_v12.0.0/build/path1/lib64/libbufr_4.a
real 2m1.337s
user 2m0.621s
sys 0m0.310s

DavidHuber-NOAA · 2023-11-09T15:25:05Z

@jack-woollen @AlexanderRichert-NOAA Those are some very promising times. I will give that a try in the regression tests on Hera later today.

And yes, I tried these tests out on Hera as well with spack-stack/1.5.1. There isn't a long runtime issue on that system, but there are reproducibility issues. I looked into it this morning and it appears that CRTM is still being built with CMAKE_BUILD_TYPE=RelWithDebInfo (/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-tbbqmsa/.spack/spack-build-01-cmake-out.txt). Alex, I thought that the default build type was supposed to change to Release, is this not the case?

DavidHuber-NOAA · 2023-11-09T16:20:40Z

I recompiled both develop and spack-stack on Orion and reran the regression tests this morning. The time results were much improved, though somewhat erratic. Perhaps I forgot to recompile after updating the modulefiles yesterday or perhaps there is a system issue. That said, four tests still failed their timethresh tests: hafs_edenvar_hybens, hafs_4denvar_glbens, global_enkf, and rrfs_3denvar_glbens. The HAFS tests seem to be the most affected. Below are all of the timings on Orion:

test	hi/lo proc	branch	time
global_4denvar	hiproc	develop	283.738921
global_4denvar	hiproc	spack-stack	291.758573
global_4denvar	loproc	develop	391.136941
global_4denvar	loproc	spack-stack	400.481576
global_enkf	hiproc	develop	65.608062
global_enkf	hiproc	spack-stack	74.296133
global_enkf	loproc	develop	108.938672
global_enkf	loproc	spack-stack	91.795480
hafs_3denvar_hybens	hiproc	develop	259.064089
hafs_3denvar_hybens	hiproc	spack-stack	255.735360
hafs_3denvar_hybens	loproc	develop	315.284807
hafs_3denvar_hybens	loproc	spack-stack	564.090202
hafs_4denvar_glbens	hiproc	develop	273.772581
hafs_4denvar_glbens	hiproc	spack-stack	291.219525
hafs_4denvar_glbens	loproc	develop	404.680337
hafs_4denvar_glbens	loproc	spack-stack	539.146473
netcdf_fv3_regional	hiproc	develop	69.890257
netcdf_fv3_regional	hiproc	spack-stack	64.494140
netcdf_fv3_regional	loproc	develop	73.620493
netcdf_fv3_regional	loproc	spack-stack	79.163477
rrfs_3denvar_glbens	hiproc	develop	78.840823
rrfs_3denvar_glbens	hiproc	spack-stack	122.124053
rrfs_3denvar_glbens	loproc	develop	87.458248
rrfs_3denvar_glbens	loproc	spack-stack	110.392840
rtma	hiproc	develop	223.007984
rtma	hiproc	spack-stack	236.367820
rtma	loproc	develop	237.599617
rtma	loproc	spack-stack	213.418943

For comparison, below are the same runtimes on Hera. The runtimes are obviously slower for spack-stack on Hera (I'm guessing due to optimization settings), but not to the same extent as Orion.

test	hi/lo proc	branch	time
global_4denvar	hiproc	develop	290.166800
global_4denvar	hiproc	spack-stack	308.381291
global_4denvar	loproc	develop	359.552020
global_4denvar	loproc	spack-stack	378.906640
global_enkf	hiproc	develop	60.475895
global_enkf	hiproc	spack-stack	62.403168
global_enkf	loproc	develop	82.992120
global_enkf	loproc	spack-stack	91.201193
hafs_3denvar_hybens	hiproc	develop	203.225758
hafs_3denvar_hybens	hiproc	spack-stack	225.581660
hafs_3denvar_hybens	loproc	develop	289.802607
hafs_3denvar_hybens	loproc	spack-stack	308.264200
hafs_4denvar_glbens	hiproc	develop	246.582969
hafs_4denvar_glbens	hiproc	spack-stack	258.118368
hafs_4denvar_glbens	loproc	develop	341.492453
hafs_4denvar_glbens	loproc	spack-stack	351.850318
netcdf_fv3_regional	hiproc	develop	55.377637
netcdf_fv3_regional	hiproc	spack-stack	58.965612
netcdf_fv3_regional	loproc	develop	66.615093
netcdf_fv3_regional	loproc	spack-stack	72.121535
rrfs_3denvar_glbens	hiproc	develop	64.051000
rrfs_3denvar_glbens	hiproc	spack-stack	64.975927
rrfs_3denvar_glbens	loproc	develop	80.114589
rrfs_3denvar_glbens	loproc	spack-stack	115.086221
rtma	hiproc	develop	188.083975
rtma	hiproc	spack-stack	190.591114
rtma	loproc	develop	199.716769
rtma	loproc	spack-stack	198.905565

I'm going to double check that I have all of the right libraries and then try rerunning the tests on Orion on /work2 just to see if that makes a difference.

jack-woollen · 2023-11-09T18:53:01Z

Here's a summary of the latest results as shown above. The orion side has some non-bufr issues for sure. I think given the latest optimization of the bufr code, and if I understand correctly that these tests are only variations of running the gsi executable, where the bufr data is read just once and done, then any difference between develop and spack-stack of more than a very few seconds is due to something other than the bufrlib. That might include i/o or cpu bottlenecks or other hardware or software factors which could be in play.

test	hi/lo proc	branch	hera time	hera dev-stk	orion dev-stk	orion time
global_4denvar	hiproc	develop	290.17			283.74
global_4denvar	hiproc	spack-stack	308.38	18.21	8.02	291.76
global_4denvar	loproc	develop	359.55			391.14
global_4denvar	loproc	spack-stack	378.91	19.35	9.34	400.48
global_enkf	hiproc	develop	60.48			65.61
global_enkf	hiproc	spack-stack	62.40	1.93	8.69	74.30
global_enkf	loproc	develop	82.99			108.94
global_enkf	loproc	spack-stack	91.20	8.21	-17.14	91.80
hafs_3denvar_hybens	hiproc	develop	203.23			259.06
hafs_3denvar_hybens	hiproc	spack-stack	225.58	22.36	-3.33	255.74
hafs_3denvar_hybens	loproc	develop	289.80			315.28
hafs_3denvar_hybens	loproc	spack-stack	308.26	18.46	248.81	564.09
hafs_4denvar_glbens	hiproc	develop	246.58			273.77
hafs_4denvar_glbens	hiproc	spack-stack	258.12	11.54	17.45	291.22
hafs_4denvar_glbens	loproc	develop	341.49			404.68
hafs_4denvar_glbens	loproc	spack-stack	351.85	10.36	134.47	539.15
netcdf_fv3_regional	hiproc	develop	55.38			69.89
netcdf_fv3_regional	hiproc	spack-stack	58.97	3.59	-5.40	64.49
netcdf_fv3_regional	loproc	develop	66.62			73.62
netcdf_fv3_regional	loproc	spack-stack	72.12	5.51	5.54	79.16
rrfs_3denvar_glbens	hiproc	develop	64.05			78.84
rrfs_3denvar_glbens	hiproc	spack-stack	64.98	0.92	43.28	122.12
rrfs_3denvar_glbens	loproc	develop	80.11			87.46
rrfs_3denvar_glbens	loproc	spack-stack	115.09	34.97	22.93	110.39
rtma	hiproc	develop	188.08			223.01
rtma	hiproc	spack-stack	190.59	2.51	13.36	236.37
rtma	loproc	develop	199.72			237.60
rtma	loproc	spack-stack	198.91	-0.81	-24.18	213.42

DavidHuber-NOAA · 2023-11-09T19:22:41Z

@jack-woollen Agreed that these do not have to do with BUFR (I'm trying to keep the BUFR-related tests in #642). I believe the differences on Hera are due to optimization differences of the libraries between develop (hpc-stack) and spack-stack. Looking into this some more, it appears the only library that is compiled with RelWithDebInfo that is used by the GSI is the CRTM, so it appears the time differences are (likely) just related to that library. This is an issue being tracked in JCSDA/spack-stack#827.

The differences on Orion are something more and I think I/O may be a significant issue. The /work filesystem has always been slow and perhaps the lower optimization of spack-stack is magnified if additional disk I/O accompanies lower optimization. Indeed, I just finished a test on /work2 and the very large time differences between the HAFS tests go away. There's some difference, but it is much smaller:

test	hi/lo proc	branch	time
global_4denvar	hiproc	develop	275.825299
global_4denvar	hiproc	spack-stack	314.997323
global_4denvar	loproc	develop	371.205687
global_4denvar	loproc	spack-stack	395.405433
global_enkf	hiproc	develop	65.735713
global_enkf	hiproc	spack-stack	68.736429
global_enkf	loproc	develop	83.469664
global_enkf	loproc	spack-stack	86.675028
hafs_3denvar_hybens	hiproc	develop	254.080952
hafs_3denvar_hybens	hiproc	spack-stack	277.932448
hafs_3denvar_hybens	loproc	develop	336.877198
hafs_3denvar_hybens	loproc	spack-stack	377.288445
hafs_4denvar_glbens	hiproc	develop	297.089561
hafs_4denvar_glbens	hiproc	spack-stack	310.065488
hafs_4denvar_glbens	loproc	develop	398.269650
hafs_4denvar_glbens	loproc	spack-stack	429.657761
netcdf_fv3_regional	hiproc	develop	61.574793
netcdf_fv3_regional	hiproc	spack-stack	74.771153
netcdf_fv3_regional	loproc	develop	72.966334
netcdf_fv3_regional	loproc	spack-stack	87.934112
rrfs_3denvar_glbens	hiproc	develop	97.207300
rrfs_3denvar_glbens	hiproc	spack-stack	102.607774
rrfs_3denvar_glbens	loproc	develop	93.075213
rrfs_3denvar_glbens	loproc	spack-stack	125.331071
rtma	hiproc	develop	206.374645
rtma	hiproc	spack-stack	206.658501
rtma	loproc	develop	216.188268
rtma	loproc	spack-stack	216.373627

I think I am satisfied with these timings on Orion. I'm going to move on to Hercules, Jet, and S4.

DavidHuber-NOAA · 2023-11-09T20:56:11Z

I built this branch on S4. I won't be able to test that system until the global workflow is updated since the regression test dataset has restricted data on it which is not allowed on S4.

This include netcdf/4.9.2 and hdf5/1.14.0. However, it does not include the upgrades in spack-stack on the other systems for ncdiag/1.1.2, ip/4.3.0, and w3emc/2.10.0.

DavidHuber-NOAA pushed a commit to DavidHuber-NOAA/GSI that referenced this issue Jul 13, 2023

Testing environment for spack-stack on Hera NOAA-EMC#589

ad3831b

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Jul 13, 2023

Testing environment for spack-stack on Hera NOAA-EMC#589

33ff391

DavidHuber-NOAA changed the title ~~Building with the spack-stack unified environment~~ Building with the spack-stack unified environment on non-production machines Jul 14, 2023

DavidHuber-NOAA mentioned this issue Jul 14, 2023

can we deprecate this project and stop working on it? NOAA-EMC/hpc-stack#475

Open

ulmononian mentioned this issue Jul 14, 2023

Merge release/1.4.1 into develop JCSDA/spack-stack#671

Merged

4 tasks

DavidHuber-NOAA mentioned this issue Jul 21, 2023

INFRA-53.2: spack-stack-1.5.0 library updates JCSDA/spack-stack#676

Closed

DavidHuber-NOAA mentioned this issue Jul 25, 2023

[INSTALL] bufr/11.7.0 and gsi-ncdiag/1.1.1 under spack-stack 1.4.1 JCSDA/spack-stack#687

Closed

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Jul 31, 2023

Updated modulefiles to point to spack-stack NOAA-EMC#474 NOAA-EMC#571 N…

c12cbf7

…OAA-EMC#589

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Jul 31, 2023

Updated noaa-cloud spack-stack location NOAA-EMC#589.

bd942fa

RussTreadon-NOAA mentioned this issue Aug 23, 2023

GSI fails to run using newer Intel compilers (i.e, Intel 2022) #447

Closed

RussTreadon-NOAA mentioned this issue Sep 11, 2023

For spack-stack@1.5.0 / GSI: add bufr@11..7.1 to unified environment (no, no longer needed) and switch to gsi-ncdiag@1.1.1 JCSDA/spack-stack#738

Closed

This was referenced Sep 13, 2023

Use EPIC-managed spack-stack installations for non-production machines NOAA-EMC/GSI-utils#18

Closed

Use EPIC-managed spack-stack installations for non-production machines NOAA-EMC/GSI-Monitor#98

Closed

jbathegit mentioned this issue Oct 25, 2023

add timing thresholds to test codes NOAA-EMC/NCEPLIBS-bufr#532

Draft

jbathegit mentioned this issue Nov 2, 2023

Jack Woollen's fix to optimize upb8.f routine NOAA-EMC/NCEPLIBS-bufr#534

Merged

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 9, 2023

Upgraded S4 to spack-stack/1.5.1. NOAA-EMC#589

46c7beb

DavidHuber-NOAA mentioned this issue Nov 9, 2023

Add Hercules support for GSI #574

Closed

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 14, 2023

Update Hera GNU modulefile to spack-stack/1.5.1. NOAA-EMC#589

28ebd5f

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 14, 2023

Update Jet spack-stack to 1.5.1 NOAA-EMC#589

e2e8f6e

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 14, 2023

Update Hera/intel spack-stack to 1.5.1. NOAA-EMC#589

4deb40f

RussTreadon-NOAA closed this as completed in #624 Nov 30, 2023

Building with the spack-stack unified environment on non-production machines #589

Building with the spack-stack unified environment on non-production machines #589

Comments

DavidHuber-NOAA commented Jul 13, 2023

DavidHuber-NOAA commented Jul 13, 2023

DavidHuber-NOAA commented Jul 14, 2023

RussTreadon-NOAA commented Jul 21, 2023

RussTreadon-NOAA commented Jul 21, 2023

RussTreadon-NOAA commented Jul 21, 2023

DavidHuber-NOAA commented Jul 21, 2023

arunchawla-NOAA commented Jul 21, 2023

jbathegit commented Jul 21, 2023 • edited Loading

jack-woollen commented Jul 23, 2023

jbathegit commented Jul 26, 2023 • edited Loading

jbathegit commented Jul 26, 2023

DavidHuber-NOAA commented Sep 14, 2023 • edited Loading

RussTreadon-NOAA commented Sep 14, 2023

DavidHuber-NOAA commented Sep 14, 2023

AlexanderRichert-NOAA commented Sep 14, 2023

AlexanderRichert-NOAA commented Sep 14, 2023

jack-woollen commented Oct 26, 2023

jack-woollen commented Oct 29, 2023 • edited Loading

DavidHuber-NOAA commented Oct 30, 2023

edwardhartnett commented Oct 30, 2023

DavidHuber-NOAA commented Oct 30, 2023

jack-woollen commented Oct 30, 2023

DavidHuber-NOAA commented Oct 30, 2023 • edited Loading

jack-woollen commented Nov 1, 2023 • edited Loading

DavidHuber-NOAA commented Nov 2, 2023

jbathegit commented Nov 2, 2023

jbathegit commented Nov 2, 2023

jack-woollen commented Nov 2, 2023

jbathegit commented Nov 2, 2023

jack-woollen commented Nov 2, 2023

DavidHuber-NOAA commented Nov 8, 2023 • edited Loading

jack-woollen commented Nov 9, 2023 • edited Loading

DavidHuber-NOAA commented Nov 9, 2023

DavidHuber-NOAA commented Nov 9, 2023

jack-woollen commented Nov 9, 2023

DavidHuber-NOAA commented Nov 9, 2023

DavidHuber-NOAA commented Nov 9, 2023

jbathegit commented Jul 21, 2023 •

edited

Loading

jbathegit commented Jul 26, 2023 •

edited

Loading

DavidHuber-NOAA commented Sep 14, 2023 •

edited

Loading

jack-woollen commented Oct 29, 2023 •

edited

Loading

DavidHuber-NOAA commented Oct 30, 2023 •

edited

Loading

jack-woollen commented Nov 1, 2023 •

edited

Loading

DavidHuber-NOAA commented Nov 8, 2023 •

edited

Loading

jack-woollen commented Nov 9, 2023 •

edited

Loading