Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building with the spack-stack unified environment on non-production machines #589

Closed
DavidHuber-NOAA opened this issue Jul 13, 2023 · 61 comments · Fixed by #624
Closed

Building with the spack-stack unified environment on non-production machines #589

DavidHuber-NOAA opened this issue Jul 13, 2023 · 61 comments · Fixed by #624

Comments

@DavidHuber-NOAA
Copy link
Collaborator

With the upgrade to HDF5 version 1.14.0 ongoing (#563), this would be a good time to move to the spack-stack unified environment, which also incorporates this version of HDF5 and netCDF version 4.9.2. Some additions are needed to the unified environment (in particular, ncio, ncdiag, and bufr), but these can be added manually as they already exist in spack. This is also desired for the upcoming UFS SRW release, which will incorporate the GSI.

DavidHuber-NOAA pushed a commit to DavidHuber-NOAA/GSI that referenced this issue Jul 13, 2023
DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Jul 13, 2023
@DavidHuber-NOAA
Copy link
Collaborator Author

I have established a testing environment on Hera referencing a test build of the spack-stack unified environment. So far, this is working well. All regression tests run to completion. I am running the regression tests against the branch to be merged in #571. There are some differences between the two, with the hwrf, global_4denvar, global_3denvar, global_3dvar, and rtma tests all failing cost and/or penalty comparison tests, with values differencing in the 1-5% range on the first iteration.

One potential source for differences is a newer version of the BUFR library. The spack-stack test install is running version 12.0.0 versus the hpc-stack-based install that is compiled against version 11.7.0. Running with the older version with spack-stack would determine if this is a source of difference.

@DavidHuber-NOAA DavidHuber-NOAA changed the title Building with the spack-stack unified environment Building with the spack-stack unified environment on non-production machines Jul 14, 2023
@DavidHuber-NOAA
Copy link
Collaborator Author

@arunchawla-NOAA @RussTreadon-NOAA @hu5970 @ulmononian @mark-a-potts
I gave regression tests on Hera with spack-stack version 1.4.1 + ncdiag/1.1.1 and bufr/11.7.0. The large differences that I was seeing with bufr/12.0.0 are now on the order of 10^4 smaller, so it seems that the new version of bufr is producing quite different results. Is it known if this difference is expected? And if so, would it be desirable to upgrade to bufr/12.0.0 at this time or should I continue incorporating spack-stack with bufr/11.7.0?

@RussTreadon-NOAA
Copy link
Contributor

Dave ran ctests on Hera using develop built with bufr/11.7.0 and develop built with bufr/12.0.0. He reported large differences in the minimization.

Output from the loproc global_4denvar for updat (bufr/12.0.0) and contrl (bufr11.7.0) was examined. Initial differences are limited to prepbufr temperature observations.

The initial temperature (obs-ges) innovation statistics (fort.203) differ for prepbufr report types 120, 132, 180, and 182. These report types are virtual temperatures. Bias and rms are larger for observations closer to the surface. Differences decrease moving up in the atmosphere. Innovation statistics for report types 130, 131, 133, 134, 136, and 136 are identical. These report types are sensible temperatures.

This suggests something odd happens when processing virtual temperature observations with bufr/12.0.0.

Prints added to src/gsi/read_prepbufr.f90 reveal that when gsi.x is built with bufr/12.0.0 the following call to ufbqcd

! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',vtcd)

returns vctd=0.395253-322 for sonde temperature obs. (I used g13.6 formatting and 0.395253-322 is printed to stdout)

Recompiling the code with bufr/11.7.0 and running the same case returns vtcd=8.00000 for sonde temperature obs.

Do we need to change or modify the call to ufbqcd or is a different code change needed when moving to bufr/12.0.0?

@RussTreadon-NOAA
Copy link
Contributor

Write vtcd without formatting. bufr/12.0.0 run has 3.952525166729972E-323.

Replace

 ! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',vtcd)

with

 ! Obtain program code (VTCD) associated with "VIRTMP" step
  call ufbqcd(lunin,'VIRTMP',ivtcd)
  vtcd = ivtcd

in src/gsi/read_prepbufr.f90. Recompile gsi.x with bufr/12.0.0 and run global_3dvar ctest. This time get expected value of 8.00000000000000 for vtcd.

The global_3dvar ctest failed

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R global_3dvar
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 1: global_3dvar
1/1 Test #1: global_3dvar .....................***Failed  2039.84 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 2039.92 sec

The following tests FAILED:
          1 - global_3dvar (Failed)

This failure, however, was due to the maximum allowable threshold time check. The updat bufr/12.0.0 gsi.x ran slower than the contrl bufr/11.7.0 gsi.x.

global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 321.045029
global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 358.756143
global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 474.135176
global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 512.837189

This may reflect the additional prints in the bufr/12.0.0 read_prepbufr.f90.

Comparison of the analysis results shows that the bufr/12.0.0 run reproduces the bufr/11.7.0 run. This result is reasonable. Changing the bufr library version should not change analysis results.

Not sure why replacing a real(8) argument with an integer(4) argument was needed in this call to ufbqcd. Similar changes were not made elsewhere. Thus, not confident the above change is the best change.

My working directories on Hera are

  • updat with bufr/12.0.0 in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12
  • contrl with bufr/11.7.0 in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/develop

global_3dvar ctest results are in /scratch1/NCEPDEV/stmp2/Russ.Treadon/bufr12/tmpreg_global_3dvar

Other ctests have not yet been run with the above change.

@RussTreadon-NOAA
Copy link
Contributor

Hera ctests with bufr/12.0.0

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -j 9
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  974.84 sec
2/9 Test #5: hwrf_nmm_d3 ......................***Failed  992.72 sec
3/9 Test #7: rrfs_3denvar_glbens ..............   Passed  1037.49 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  1039.35 sec
5/9 Test #9: global_enkf ......................***Failed  1202.61 sec
6/9 Test #6: rtma .............................***Failed  2002.66 sec
7/9 Test #3: global_4denvar ...................***Failed  2224.99 sec
8/9 Test #2: global_4dvar .....................   Passed  2230.50 sec
9/9 Test #1: global_3dvar .....................***Failed  2392.88 sec

33% tests passed, 6 tests failed out of 9

Total Test time (real) = 2393.03 sec

The following tests FAILED:
          1 - global_3dvar (Failed)
          3 - global_4denvar (Failed)
          5 - hwrf_nmm_d3 (Failed)
          6 - rtma (Failed)
          8 - netcdf_fv3_regional (Failed)
          9 - global_enkf (Failed)

All failures except rtma were due to timing checks. Apart from rtma, bufr/12.0.0 analyses are identical with their bufr/11.7.0 counterparts. For several failed tests, bufr/12.0.0 executable ran slower than bufr/11.7.0 executable for one or more of the configurations. For example, global_3dvar

tmpreg_global_3dvar/global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 326.543222
tmpreg_global_3dvar/global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 359.565145
tmpreg_global_3dvar/global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 467.854069
tmpreg_global_3dvar/global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 509.382080

and global_4denvar

tmpreg_global_4denvar/global_4denvar_hiproc_contrl/stdout:The total amount of wall time                        = 373.331819
tmpreg_global_4denvar/global_4denvar_hiproc_updat/stdout:The total amount of wall time                        = 371.220896
tmpreg_global_4denvar/global_4denvar_loproc_contrl/stdout:The total amount of wall time                        = 385.547573
tmpreg_global_4denvar/global_4denvar_loproc_updat/stdout:The total amount of wall time                        = 434.971107

The rtma failure shows different initial innovation statistics for temperature report types 197 and 199. Find another call ufbqcd in read_prepbufr.f90

 !see if file contains GLERL program code (GLCD)
 !Obtain code if it exists.  Otherwise set to missing (-999)
   call status(lunin,lindx,idummy1,idummy2)
   call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
   if (glret /= 0) then
     call ufbqcd(lunin,'GLERL',glcd)

As a test, replace glcd with iglcd in calling list and set glcd to iglcd after call

 !see if file contains GLERL program code (GLCD)
 !Obtain code if it exists.  Otherwise set to missing (-999)
   call status(lunin,lindx,idummy1,idummy2)
   call nemtab(lindx,'GLERL',idummy1,cdummy,glret)
   if (glret /= 0) then
     call ufbqcd(lunin,'GLERL',iglcd)
     glcd = iglcd

Recompile and rerun rtma test. This time the rtma test passed.

Hera(hfe07):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build$ ctest -R rtma
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/bufr12/build
    Start 6: rtma
1/1 Test #6: rtma .............................   Passed  1278.39 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1278.40 sec

Stop here and let others follow up / investigate more.

@DavidHuber-NOAA
Copy link
Collaborator Author

Also tagging @jbathegit @edwardhartnett @wx20jjung @dtkleist for awareness.

@arunchawla-NOAA
Copy link

@DavidHuber-NOAA and @RussTreadon-NOAA thanks for your tests on this! For now I would focus on building and testing with spack-stack that has bufr/11.7.0 I am going to bring this to the attention of @jbathegit @jack-woollen and @edwardhartnett to see if they can throw more light on Russ's results

@jbathegit
Copy link

jbathegit commented Jul 21, 2023

It looks like you figured the issue out on your own. This was indeed a planned change within v12.0.0, as noted in the release notes...

  • Subroutines ufbqcd() and ufbqcp() have been modified to pass call arguments containing event program codes as integers, rather than continuing to pass them as real numbers. [Issue #78]

This was done intentionally as part of the simplification down to one library build. Note that in previous versions of ufbqcd, the 3rd argument returned was just a real, not an explicit real*8. So it was only a real*8 in the _8 and _d builds because of the compile flags for those builds, and not because of some intrinsic need within the library itself. The return value of the 3rd argument is never larger than 255 so it fits easily inside of any real, and in fact it's always a whole number so there's really no reason not to return it as an integer and make everything simpler.

@jack-woollen
Copy link
Contributor

I feel the need to contribute my two cents in supporting backward compatibility with downstream codes as an important principle in software management. Considering reasons not to make such a change as this, seems to me there's not a good enough compensating reason to do it.

@jbathegit
Copy link

jbathegit commented Jul 26, 2023

I hear what you're saying Jack, and I totally agree that backwards compatibility is an important principle in software management. But in this case, this is no different than where we've already agreed that _8 users will now need to make a change to their application codes to add a call to setim8b, and to make sure that all calls to library functions which return integers are now explicitly declared as integer*4.

In other words, with a major release such as this (where the X of the version X.Y.Z is changing), it's understood that downstream users may need to make some adjustments to their application codes, and this ufbqcd change is just another example of that, and we've already documented this in the accompanying release notes for _8 and _d users.

Note also that, if we hadn't included this ufbqcd change, then as software managers we'd have had to instead develop new internal routines similar to x48 and x84, but for reals instead of integers, since this routine (along with its sister routine ufbqcp) were the only cases of user-callable routines where a real argument was being passed that wasn't already explicitly declared as real*8. And we couldn't have just changed the library to now make that 3rd argument an explicit real*8, because doing so would have similarly impacted any existing _4 users. So bottom line, no matter what we did here, it would have impacted some of our users, and this is just one of the consequences of our earlier decision to eliminate the previous _8 and _d builds and consolidate everything down to a single _4 build.

@jbathegit
Copy link

On a separate but related note, I did note the discussion where some of the spack-stack test cases appear to be running slower with the new 12.0.0 build than with the previous 11.7.0 build. That's definitely not a feature that we intentionally introduced into the new build, nor is it something that we saw in our own regression testing for the library, so if it's still an issue of concern then we'd definitely like to better understand what part(s) of the library may be responsible. Is there any way that some sort of profiling tool might better pinpoint where those differences are (e.g. in which library functions or lines of code) for your test cases?

@DavidHuber-NOAA
Copy link
Collaborator Author

DavidHuber-NOAA commented Sep 14, 2023

I ran regression tests with spack-stack/1.4.1 on Hera against the current develop branch with the following results:

1/9 Test #7: rrfs_3denvar_glbens ..............   Passed  1293.51 sec
2/9 Test #4: hwrf_nmm_d2 ......................***Failed  1592.51 sec
3/9 Test #5: hwrf_nmm_d3 ......................   Passed  1601.61 sec
4/9 Test #9: global_enkf ......................   Passed  1616.37 sec
5/9 Test #8: netcdf_fv3_regional ..............***Failed  1651.17 sec
6/9 Test #6: rtma .............................   Passed  2195.24 sec
7/9 Test #3: global_4denvar ...................***Failed  2551.00 sec
8/9 Test #2: global_4dvar .....................***Failed  2608.58 sec
9/9 Test #1: global_3dvar .....................***Failed  2730.67 sec

Looking into the netcdf_fv3_regional case, differences first appear in initial radiance penalties. This seems like a CRTM issue, though I am having trouble tracking it down.

The CRTM directories are
hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0
spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl

The CRTM fix directories are
hpc-stack: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0/fix
spack-stack: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0_emc-zvsufvh

The contents of the fix directories are identical. The hpc-stack source code for CRTM is located here: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/crtm-v2.4.0, which is identical to the version cloned by spack-stack (github.com/JCSDA/crtm version 2.4.0; hash 5ddd0d6).

So there doesn't seem to be an obvious difference in CRTM. The library versions do differ between the two stacks. In particular, hdf5/netcdf 1.14.0/4.9.2 and nemsio in spack-stack is still at 2.5.2 and includes w3nco, but I don't think these should have an impact.

hpc-stack

  1) miniconda3/4.12.0    9) netcdf/4.7.4  17) nemsio/2.5.4
  2) hpc/1.2.0           10) bufr/11.7.0   18) wrf_io/1.2.0
  3) intel/2022.1.2      11) bacio/2.4.1   19) ncio/1.1.2
  4) hpc-intel/2022.1.2  12) w3emc/2.9.2   20) crtm/2.4.0
  5) impi/2022.1.2       13) sp/2.3.3      21) ncdiag/1.1.1
  6) hpc-impi/2022.1.2   14) ip/3.3.3      22) gsi_common
  7) cmake/3.20.1        15) sigio/2.3.2   23) prod_util/1.2.2
  8) hdf5/1.10.6         16) sfcio/1.4.1   24) gsi_hera.intel

spack-stack

  1) intel/2022.1.2                   12) zstd/1.5.2            23) nemsio/2.5.2
  2) stack-intel/2021.5.0             13) netcdf-c/4.9.2        24) wrf-io/1.2.0
  3) impi/2022.1.2                    14) netcdf-fortran/4.6.0  25) ncio/1.1.2
  4) stack-intel-oneapi-mpi/2021.5.1  15) bufr/11.7.0           26) crtm-fix/2.4.0_emc
  5) miniconda/3.9.12                 16) bacio/2.4.1           27) git-lfs/2.10.0
  6) stack-python/3.9.12              17) w3emc/2.9.2           28) crtm/2.4.0
  7) cmake/3.23.1                     18) sp/2.3.3              29) gsi-ncdiag/1.1.1
  8) curl/7.29.0                      19) ip/3.3.3              30) gsi_common
  9) pkg-config/0.27.1                20) sigio/2.3.2           31) prod-util/1.2.2
 10) zlib/1.2.13                      21) sfcio/1.4.1           32) gsi_hera.intel
 11) hdf5/1.14.0                      22) w3nco/2.4.1

@RussTreadon-NOAA Would you mind taking a look at the RTs on Hera? My regression test directory is in /scratch1/NCEPDEV/nems/David.Huber/GSI/gsi_spackstack/build/regression and the working directories are in /scratch1/NCEPDEV/stmp/David.Huber/ptmp_spack.

@RussTreadon-NOAA
Copy link
Contributor

A check of the run directories shows the crtm coefficients to be identical between the updat and contrl runs.

A check of fort.207 shows initial radiance penalty differences between the updat and contrl for the following satellite_sensors: amsua_n15, amsua_n18, amsua_n19, atms_npp, atms_n20. These are all microwave sensors. However, other microwave satellite_sensors have identical initial penalties between the updat and contrl (e.g., mhs_n19, amsua_metop-b, amsua_metop-c, mhs_metop-c).

Are the same compiler flags used to build the hpc-stack and spack-stack crtm libraries?

@DavidHuber-NOAA
Copy link
Collaborator Author

@AlexanderRichert-NOAA Could you point me to the build log for spack-stack/1.4.1 CRTM/2.4.0 on Hera? I'm seeing some differences between it and the hpc-stack version.

@AlexanderRichert-NOAA
Copy link
Contributor

Sure-- /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl/.spack/spack-build-out.txt (and there's other build-related info in that directory)

@AlexanderRichert-NOAA
Copy link
Contributor

I won't look at it any further unless you'd like me to, but for what it's worth the difference that stands out to me is the CMAKE_BUILD_TYPE (hpc-stack=Release, spack-stack=RelWithDebInfo), and we've seen some differences in other codes based on that setting.

@jack-woollen
Copy link
Contributor

@jbathegit @edwardhartnett I just ran a simple test on wc2 reading a large satellite file, which is basically all the gsi does with bufrlib, and got much the same result David did in his tests. With bl11 it took 2m5s, with bl12 2m45s. Timing changes a little from run to run but it generally takes something like 40s longer with bl12. My test literally just read the data using ireadmg and ireadsb then stopped. I'll proceed to comment the im8b test blocks for routine involved in the read process and see what happens. The fix may just be something like separate paths through the lower level routines for 4 and 8 byte cases. Depends on how many routines that turns out to be.

@jack-woollen
Copy link
Contributor

jack-woollen commented Oct 29, 2023

Moving the tests to hera, using gfortran, v12 was actually a little bit faster than v11. So how do you compile the bufrlib with ifort on hera?

@DavidHuber-NOAA
Copy link
Collaborator Author

@jack-woollen Hmm, interesting. I compiled version 12 of bufrlib two ways -- CMAKE_BUILD_TYPE="Release" (case 2 here) and CMAKE_BUILD_TYPE="RelWithDebInfo" (case 1). Both are slower than version 11.7.0 with CMAKE_BUILD_TYPE="Release" (case ctl).

@edwardhartnett
Copy link

Can this problem be demonstrated in a bufrlib test program?

That is, can we get a simple, one-file program, which runs slower under v12 than v11?

@DavidHuber-NOAA
Copy link
Collaborator Author

@edwardhartnett It sounded like Jack had done just that in this comment.

@jack-woollen
Copy link
Contributor

Timing these codes can be fiddly, depending on what platform it tested on, and when. WCOSS2 has demonstrated widely varying timings, via the unix time command. At first the results looked like v12 was slower than v11. That's when I wrote the comment above. But when I ran it over and over and over the results became more murky. Then on Sunday I tried timing on hera where v12 was a couple seconds faster then v11, but compiling with gfortran. Them I went back to WC2 (and ifort) and got results like hera, v12 faster by a hair.

So far my simple timing test only checks reading a large and compressed (mtiasi) file. I'm thinking running the GSI observer with a full set of data will give a better comparison for that code. Working on it now . The elephant in the room for bufrlib timing may be the prepobs process, since it does reading, writing and arithmetic. I'll look at that also.

My question about about compiling bufrlib on hera with ifort, refers to when I download bufrlib from git, the cmake default is to use gfortran. I'm wondering what to change so it compiles with ifort. Any help with that is appreciated.

@DavidHuber-NOAA
Copy link
Collaborator Author

DavidHuber-NOAA commented Oct 30, 2023

@jack-woollen Ah, understood. To compile with Intel on Hera, you only need to load the intel/2022.1.2 module then run cmake. Cmake will detect the Intel compiler and create a makefile based on it.

@jack-woollen
Copy link
Contributor

jack-woollen commented Nov 1, 2023

Thanks to @DavidHuber-NOAA help with the gsi cmake setup I have been able to make timing comparisons of running the gsi observer mode with BF-11.7.0 and BF-12.0.0. The observer reads all the BUFR datasets into the program and then will stop if the variable miter=0. So it exercises most if not all the bufr coding within the gsi, and it only takes a few minutes to complete. Trying this and trying that has revealed what looks like roughly half of the timing issue apparent in the gsi runs using bufrlib version 12. Subroutine upb8.f, which is called when unpacking compressed bufr datasets, benefitted from some optimization specifically for the 4byte case. This subroutine is not introduced as part of the mods to the single library build, but rather to accomodate the new WMO approved feature of allowing numeric elements to fill up to 64 bits, instead of just 32 as it has been until now. The fix for upb8.f can be reviewed on dogwood in directory /lfs/h2/emc/global/noscrub/Jack.Woollen/bufrtime/bufr_v12.0.0/NCEPLIBS-bufr/src.

Following is a list of timing results before and after the mods to upb8.

Before upb8 change

The total amount of time in user mode = 268.026405 ~40s
The total amount of time in user mode = 308.623302

The total amount of time in user mode = 267.477490 ~42s
The total amount of time in user mode = 309.731031

The total amount of time in user mode = 267.551686 ~43s
The total amount of time in user mode = 310.306406

The total amount of time in user mode = 266.931843 ~50s
The total amount of time in user mode = 316.532808

The total amount of time in user mode = 268.897811 ~47s
The total amount of time in user mode = 315.577455

After upb8 change

The total amount of time in user mode = 268.612845 ~22s
The total amount of time in user mode = 290.555594

The total amount of time in user mode = 274.917766 ~14s
The total amount of time in user mode = 288.483719

The total amount of time in user mode = 265.291353 ~25s
The total amount of time in user mode = 290.776556

The total amount of time in user mode = 265.125857 ~25s
The total amount of time in user mode = 290.373777

The total amount of time in user mode = 267.488541 ~23s
The total amount of time in user mode = 290.313897

@DavidHuber-NOAA
Copy link
Collaborator Author

@jack-woollen Fantastic! Thanks for the update. I will run this optimization through the GSI regression tests.

@jbathegit
Copy link

@jack-woollen I also have to add my kudos - well done sir! Your change to add an nbits<=32 block to optimize performance for the 4-byte majority of cases makes perfect sense, and it looks like David's preliminary tests with your update are having the intended effect as well. Thanks again!

I'll go ahead and set up a PR to add this to the library baseline for the next release, and I'll let you know when it's ready.

@jbathegit
Copy link

BTW, and just for my own FYI, how were you doing your timings? Were you using calls to the intrinsic routines cpu_time and system_clock inside of the Fortran source code, or were you using something like the UNIX time command inside of a shell script? As you know, I've started working to add some timing tests to the library CI suite, so I'd like to replicate it using whatever best practices are being used by others for such testing in GitHub or elsewhere.

Or maybe this is more of a question for @DavidHuber-NOAA or others?

@jack-woollen
Copy link
Contributor

@jbathegit For the gsi there is the timing printed from the w3 routine I think. For other test I use the unix time command. In either case the user time is more stable than the wall time for comparisons.

@jbathegit
Copy link

Thanks Jack.

Do you happen to know which w3 routine they're using? Maybe w3utcdat?

And by "user time", I presume you mean CPU time?

@jack-woollen
Copy link
Contributor

Jeff, I don't know what timer is used in gsi. It isn't w3utcdat, at least I can't find that in the code. What ever it is it reports lke this:

RESOURCE STATISTICS**************
The total amount of wall time = 265.619982
The total amount of time in user mode = 292.463298
The total amount of time in sys mode = 5.330310
The maximum resident set size (KB) = 1827508
Number of page faults without I/O activity = 171892
Number of page faults with I/O activity = 24
Number of times filesystem performed INPUT = 1938274
Number of times filesystem performed OUTPUT = 121176
Number of Voluntary Context Switches = 6155
Number of InVoluntary Context Switches = 417
END OF RESOURCE STATISTICS********

Of course the unix timer reports:

real 0m0.012s
user 0m0.007s
sys 0m0.005s

In either case I think the user time is more stable than wall time. It is probably more like cpu time.

@DavidHuber-NOAA
Copy link
Collaborator Author

DavidHuber-NOAA commented Nov 8, 2023

I ran regression tests again spack-stack 1.5.1 on Orion. Multiple tests failed timethresh tests. Following the suggestion in NOAA-EMC/global-workflow#2044, I tried adding --cpus-per-task to the srun call, but this still did not help. Additionally, the contrl tests run significantly faster (in some cases >2x faster) than the updat. For example, the timings for hafs_3denvar_hybens were

hafs_3denvar_hybens_loproc_contrl   05:55
hafs_3denvar_hybens_loproc_updat   12:31

The global_enkf and hafs_4denvar_glbens tests showed similar slowdowns.

I then ran regression tests on Hera using spack-stack/1.5.1. No tests failed for timethresh exceedances. It seems that the slowdowns seen on Orion are relegated to just that system and only for spack-stack/1.5.1.

However, 4 tests failed for non-reproduceable cost function results on both systems (netcdf_fv3_regional, hafs_3denvar_hybens, hafs_4denvar_glbens, and global_4denvar). These will need to be investigated.

@jack-woollen
Copy link
Contributor

jack-woollen commented Nov 9, 2023

@DavidHuber-NOAA
Something obviously not right here. Have you replicated the last round of regression tests on hera? Just curious.

FYI, I copied the latest souped up test version of bufr_v12 over to hera and compiled it with intel/2022.1.2, which is consistent with the bufr/11.7.0 module, and ran the time test reading through a 827MB mtiasi file. The results are below. They are a little better than I see on wcoss2. The updated v12 bufr working set is found on hera in /scratch1/NCEPDEV/global/Jack.Woollen/bufrtime/bufr_v12.0.0.

/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-2022.1.2/bufr/11.7.0/lib64/libbufr_4.a
real 2m0.061s
user 1m59.297s
sys 0m0.275s

/scratch1/NCEPDEV/global/Jack.Woollen/bufrtime/bufr_v12.0.0/build/path1/lib64/libbufr_4.a
real 2m1.337s
user 2m0.621s
sys 0m0.310s

@DavidHuber-NOAA
Copy link
Collaborator Author

@jack-woollen @AlexanderRichert-NOAA Those are some very promising times. I will give that a try in the regression tests on Hera later today.

And yes, I tried these tests out on Hera as well with spack-stack/1.5.1. There isn't a long runtime issue on that system, but there are reproducibility issues. I looked into it this morning and it appears that CRTM is still being built with CMAKE_BUILD_TYPE=RelWithDebInfo (/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-tbbqmsa/.spack/spack-build-01-cmake-out.txt). Alex, I thought that the default build type was supposed to change to Release, is this not the case?

@DavidHuber-NOAA
Copy link
Collaborator Author

I recompiled both develop and spack-stack on Orion and reran the regression tests this morning. The time results were much improved, though somewhat erratic. Perhaps I forgot to recompile after updating the modulefiles yesterday or perhaps there is a system issue. That said, four tests still failed their timethresh tests: hafs_edenvar_hybens, hafs_4denvar_glbens, global_enkf, and rrfs_3denvar_glbens. The HAFS tests seem to be the most affected. Below are all of the timings on Orion:

test hi/lo proc branch time
global_4denvar hiproc develop 283.738921
global_4denvar hiproc spack-stack 291.758573
global_4denvar loproc develop 391.136941
global_4denvar loproc spack-stack 400.481576
global_enkf hiproc develop 65.608062
global_enkf hiproc spack-stack 74.296133
global_enkf loproc develop 108.938672
global_enkf loproc spack-stack 91.795480
hafs_3denvar_hybens hiproc develop 259.064089
hafs_3denvar_hybens hiproc spack-stack 255.735360
hafs_3denvar_hybens loproc develop 315.284807
hafs_3denvar_hybens loproc spack-stack 564.090202
hafs_4denvar_glbens hiproc develop 273.772581
hafs_4denvar_glbens hiproc spack-stack 291.219525
hafs_4denvar_glbens loproc develop 404.680337
hafs_4denvar_glbens loproc spack-stack 539.146473
netcdf_fv3_regional hiproc develop 69.890257
netcdf_fv3_regional hiproc spack-stack 64.494140
netcdf_fv3_regional loproc develop 73.620493
netcdf_fv3_regional loproc spack-stack 79.163477
rrfs_3denvar_glbens hiproc develop 78.840823
rrfs_3denvar_glbens hiproc spack-stack 122.124053
rrfs_3denvar_glbens loproc develop 87.458248
rrfs_3denvar_glbens loproc spack-stack 110.392840
rtma hiproc develop 223.007984
rtma hiproc spack-stack 236.367820
rtma loproc develop 237.599617
rtma loproc spack-stack 213.418943

For comparison, below are the same runtimes on Hera. The runtimes are obviously slower for spack-stack on Hera (I'm guessing due to optimization settings), but not to the same extent as Orion.

test hi/lo proc branch time
global_4denvar hiproc develop 290.166800
global_4denvar hiproc spack-stack 308.381291
global_4denvar loproc develop 359.552020
global_4denvar loproc spack-stack 378.906640
global_enkf hiproc develop 60.475895
global_enkf hiproc spack-stack 62.403168
global_enkf loproc develop 82.992120
global_enkf loproc spack-stack 91.201193
hafs_3denvar_hybens hiproc develop 203.225758
hafs_3denvar_hybens hiproc spack-stack 225.581660
hafs_3denvar_hybens loproc develop 289.802607
hafs_3denvar_hybens loproc spack-stack 308.264200
hafs_4denvar_glbens hiproc develop 246.582969
hafs_4denvar_glbens hiproc spack-stack 258.118368
hafs_4denvar_glbens loproc develop 341.492453
hafs_4denvar_glbens loproc spack-stack 351.850318
netcdf_fv3_regional hiproc develop 55.377637
netcdf_fv3_regional hiproc spack-stack 58.965612
netcdf_fv3_regional loproc develop 66.615093
netcdf_fv3_regional loproc spack-stack 72.121535
rrfs_3denvar_glbens hiproc develop 64.051000
rrfs_3denvar_glbens hiproc spack-stack 64.975927
rrfs_3denvar_glbens loproc develop 80.114589
rrfs_3denvar_glbens loproc spack-stack 115.086221
rtma hiproc develop 188.083975
rtma hiproc spack-stack 190.591114
rtma loproc develop 199.716769
rtma loproc spack-stack 198.905565

I'm going to double check that I have all of the right libraries and then try rerunning the tests on Orion on /work2 just to see if that makes a difference.

@jack-woollen
Copy link
Contributor

Here's a summary of the latest results as shown above. The orion side has some non-bufr issues for sure. I think given the latest optimization of the bufr code, and if I understand correctly that these tests are only variations of running the gsi executable, where the bufr data is read just once and done, then any difference between develop and spack-stack of more than a very few seconds is due to something other than the bufrlib. That might include i/o or cpu bottlenecks or other hardware or software factors which could be in play.

<style> </style>
test hi/lo proc branch hera time hera dev-stk orion dev-stk orion time
global_4denvar hiproc develop 290.17     283.74
global_4denvar hiproc spack-stack 308.38 18.21 8.02 291.76
global_4denvar loproc develop 359.55     391.14
global_4denvar loproc spack-stack 378.91 19.35 9.34 400.48
global_enkf hiproc develop 60.48     65.61
global_enkf hiproc spack-stack 62.40 1.93 8.69 74.30
global_enkf loproc develop 82.99     108.94
global_enkf loproc spack-stack 91.20 8.21 -17.14 91.80
hafs_3denvar_hybens hiproc develop 203.23     259.06
hafs_3denvar_hybens hiproc spack-stack 225.58 22.36 -3.33 255.74
hafs_3denvar_hybens loproc develop 289.80     315.28
hafs_3denvar_hybens loproc spack-stack 308.26 18.46 248.81 564.09
hafs_4denvar_glbens hiproc develop 246.58     273.77
hafs_4denvar_glbens hiproc spack-stack 258.12 11.54 17.45 291.22
hafs_4denvar_glbens loproc develop 341.49     404.68
hafs_4denvar_glbens loproc spack-stack 351.85 10.36 134.47 539.15
netcdf_fv3_regional hiproc develop 55.38     69.89
netcdf_fv3_regional hiproc spack-stack 58.97 3.59 -5.40 64.49
netcdf_fv3_regional loproc develop 66.62     73.62
netcdf_fv3_regional loproc spack-stack 72.12 5.51 5.54 79.16
rrfs_3denvar_glbens hiproc develop 64.05     78.84
rrfs_3denvar_glbens hiproc spack-stack 64.98 0.92 43.28 122.12
rrfs_3denvar_glbens loproc develop 80.11     87.46
rrfs_3denvar_glbens loproc spack-stack 115.09 34.97 22.93 110.39
rtma hiproc develop 188.08     223.01
rtma hiproc spack-stack 190.59 2.51 13.36 236.37
rtma loproc develop 199.72     237.60
rtma loproc spack-stack 198.91 -0.81 -24.18 213.42

@DavidHuber-NOAA
Copy link
Collaborator Author

@jack-woollen Agreed that these do not have to do with BUFR (I'm trying to keep the BUFR-related tests in #642). I believe the differences on Hera are due to optimization differences of the libraries between develop (hpc-stack) and spack-stack. Looking into this some more, it appears the only library that is compiled with RelWithDebInfo that is used by the GSI is the CRTM, so it appears the time differences are (likely) just related to that library. This is an issue being tracked in JCSDA/spack-stack#827.

The differences on Orion are something more and I think I/O may be a significant issue. The /work filesystem has always been slow and perhaps the lower optimization of spack-stack is magnified if additional disk I/O accompanies lower optimization. Indeed, I just finished a test on /work2 and the very large time differences between the HAFS tests go away. There's some difference, but it is much smaller:

test hi/lo proc branch time
global_4denvar hiproc develop 275.825299
global_4denvar hiproc spack-stack 314.997323
global_4denvar loproc develop 371.205687
global_4denvar loproc spack-stack 395.405433
global_enkf hiproc develop 65.735713
global_enkf hiproc spack-stack 68.736429
global_enkf loproc develop 83.469664
global_enkf loproc spack-stack 86.675028
hafs_3denvar_hybens hiproc develop 254.080952
hafs_3denvar_hybens hiproc spack-stack 277.932448
hafs_3denvar_hybens loproc develop 336.877198
hafs_3denvar_hybens loproc spack-stack 377.288445
hafs_4denvar_glbens hiproc develop 297.089561
hafs_4denvar_glbens hiproc spack-stack 310.065488
hafs_4denvar_glbens loproc develop 398.269650
hafs_4denvar_glbens loproc spack-stack 429.657761
netcdf_fv3_regional hiproc develop 61.574793
netcdf_fv3_regional hiproc spack-stack 74.771153
netcdf_fv3_regional loproc develop 72.966334
netcdf_fv3_regional loproc spack-stack 87.934112
rrfs_3denvar_glbens hiproc develop 97.207300
rrfs_3denvar_glbens hiproc spack-stack 102.607774
rrfs_3denvar_glbens loproc develop 93.075213
rrfs_3denvar_glbens loproc spack-stack 125.331071
rtma hiproc develop 206.374645
rtma hiproc spack-stack 206.658501
rtma loproc develop 216.188268
rtma loproc spack-stack 216.373627

I think I am satisfied with these timings on Orion. I'm going to move on to Hercules, Jet, and S4.

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 9, 2023
@DavidHuber-NOAA
Copy link
Collaborator Author

I built this branch on S4. I won't be able to test that system until the global workflow is updated since the regression test dataset has restricted data on it which is not allowed on S4.

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 13, 2023
This include netcdf/4.9.2 and hdf5/1.14.0.  However, it does not include
the upgrades in spack-stack on the other systems for ncdiag/1.1.2,
ip/4.3.0, and w3emc/2.10.0.
DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 14, 2023
DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 14, 2023
DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants