[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. #37

joeylamcy · 2020-08-29T13:48:29Z

Hi everyone,

I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with forrtl: error (73): floating divide by zero. The full log files are attached below.
163214_print_out.log
163214_error.log

More information:

intel MPI with Intel 18 compiler
ESMF 8.0.0 public release built with ESMF_COMM=intelmpi

I'm not sure how to troubleshoot this issue. I tried to cmake the source code with -DCMAKE_BUILD_TYPE=Debug (with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:

forrtl: error (63): output conversion error, unit -5, file Internal Formatted Write
Image              PC                Routine            Line        Source
geos               00000000094A364E  Unknown               Unknown  Unknown
geos               00000000094F8D62  Unknown               Unknown  Unknown
geos               00000000094F6232  Unknown               Unknown  Unknown
geos               000000000226CC73  advcore_gridcompm         261  AdvCore_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos               0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos               0000000007F00B39  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               000000000844804D  Unknown               Unknown  Unknown
geos               0000000007EE2A0F  Unknown               Unknown  Unknown
geos               0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos               0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos               0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos               0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos               0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos               00000000004242FF  MAIN__                     29  GCHPctm.F90
geos               000000000042125E  Unknown               Unknown  Unknown
geos               000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002AFBC9F34505  __libc_start_main     Unknown  Unknown
geos               0000000000421169  Unknown               Unknown  Unknown

I also noticed something weird towards the start of the run:

      MAPL: No configure file specified for logging layer.  Using defaults. 
     SHMEM: NumCores per Node = 6
     SHMEM: NumNodes in use   = 1
     SHMEM: Total PEs         = 6
     SHMEM: NumNodes in use  = 1

Previous versions (12.8.2) usually shows this instead:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6


 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1

but I'm not sure if that matters.

The text was updated successfully, but these errors were encountered:

sdeastham · 2020-08-30T15:10:33Z

I suspect that the error in AdvCore_GridCompMod is misleading, but that is something we should fix. In AdvCore_GridCompMod.F90, ntracers is set to 11 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86). However, this leads to a formatting error, because a later loop for N = 1, ntracers tries to write to a string using a single-digit integer format with N-1 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L260-L269). The fix for that particular error is obvious - just set ntracers to 10 (ntracers doesn't seem to be a particularly important variable, and is only used to define these "test outputs").

I found that setting ntracers=10 does fix this error and allows you to find whatever the REAL error is. @lizziel we should raise this with GMAO and kick a pull request up the chain!

joeylamcy · 2020-08-30T19:26:16Z

Oh, right. I checked the log for the debug run again and actually the run ended way earlier than without the debug flag, so I suppose using -DCMAKE_BUILD_TYPE=Debug couldn't help me.

sdeastham · 2020-08-31T13:59:27Z

It should help - you'll just need to fix the ntracers issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.

EDIT: By "fix the ntracers issue, I literally mean change the line ntracers = 11 to ntracers = 10 in AdvCore_GridCompMod.F90 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..

lizziel · 2020-08-31T14:01:09Z

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

lizziel · 2020-08-31T17:52:24Z

Regarding the debug flags issue, I created an issue on GEOS-ESM/FVdycoreCubed_GridComp: GEOS-ESM/FVdycoreCubed_GridComp#71.

joeylamcy · 2020-09-01T03:10:09Z

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

Yes. If all collections in HISTORY.rc are commented out, the run continues smoothly. But turning on any number of the collections seems to cause the problem, i.e. not specific to one of the collections.

joeylamcy · 2020-09-01T03:59:58Z

It should help - you'll just need to fix the ntracers issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.

EDIT: By "fix the ntracers issue, I literally mean change the line ntracers = 11 to ntracers = 10 in AdvCore_GridCompMod.F90 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..

Actually I tried it, but there is some further issues. The printout is still stuck at

NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
 ncnst=           0  num_prog=           0  pnats=           0  dnats=
           0  num_family=           0
 
 Grid distance at face edge (km)=   163384.217664128

and the error log grows at a rate of ~100MB/min for at least 5 minutes, so I just manually stop the run. The leading error is still in AdvCore_GridCompMod.F90.

forrtl: warning (406): fort: (1): In call to MPI_GROUP_INCL, an array temporary was created for argument #3

Image              PC                Routine            Line        Source
geos.debug         00000000094A5440  Unknown               Unknown  Unknown
geos.debug         0000000005A6D950  mpp_mod_mp_get_pe         109  mpp_util_mpi.inc
geos.debug         0000000005A9EE8F  mpp_mod_mp_mpp_in          55  mpp_comm_mpi.inc
geos.debug         0000000004A0B3A6  fms_mod_mp_fms_in         342  fms.F90
geos.debug         000000000226E3A3  advcore_gridcompm         311  AdvCore_GridCompMod.F90
geos.debug         0000000007F00A0D  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         0000000007F01D4E  Unknown               Unknown  Unknown
geos.debug         0000000007F01A85  Unknown               Unknown  Unknown
geos.debug         0000000007EE1304  Unknown               Unknown  Unknown
geos.debug         0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos.debug         0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos.debug         0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos.debug         0000000007F00A0D  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         0000000007F01D4E  Unknown               Unknown  Unknown
geos.debug         0000000007F01A85  Unknown               Unknown  Unknown
geos.debug         0000000007EE1304  Unknown               Unknown  Unknown
geos.debug         0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos.debug         0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos.debug         0000000007F00B39  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         000000000844804D  Unknown               Unknown  Unknown
geos.debug         0000000007EE2A0F  Unknown               Unknown  Unknown
geos.debug         0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos.debug         0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos.debug         0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos.debug         0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos.debug         0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos.debug         00000000004242FF  MAIN__                     29  GCHPctm.F90
geos.debug         000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002B9C8AD6A505  __libc_start_main     Unknown  Unknown
geos.debug         0000000000421169  Unknown               Unknown  Unknown

sdeastham · 2020-09-01T13:31:43Z

The array temporary warnings are irrelevant - given enough time, the code should still reach the actual error - but I agree that it's not really helpful to have them padding the error log. They do also very much slow down the run, so although the printout appears stuck it should eventually clear.

@LiamBindle - can you recommend a preferred way to suppress array temporary warnings in FV3 using CMake? I can imagine that one could do this by editing the contents of ESMA_cmake, but that seems non-ideal.

@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).

Can you verify that your copy of MAPL_HistoryGridComp.F90 also has call o_Clients%done_collective_stage() on line 3570? That will give us a thread to tug on with GMAO.
Can you post the output of ifort --version, nc-config --all, and nf-config --all? It seems like something is going amiss deep in NetCDF.
Are any output files generated in your OutputDir directory? If so, are they valid (i.e. what happens if you run ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4)?

lizziel · 2020-09-01T14:20:42Z

I noticed you are running a c48 standard simulation with 6 cores and 3G per core across 1 node, if the log file prints are to be trusted. It surprises me that the simulation ran without running out of memory. You can try upping your resources and lowering your to resolution c24 to see if that makes a difference at all for the diagnostics.

Also try commenting out individual collections to see if there is a specific history collection consistently causing the problem.

Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.

LiamBindle · 2020-09-01T15:17:41Z

I believe those temporary array warnings can be suppressed with -check,noarg_temp_created.

Unfortunately, as you suspected @sdeastham, I think manually adding -check,noarg_temp_created to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.

Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!

lizziel · 2020-09-01T15:49:04Z

I am going to put this update into the GCHPctm 13.00-alpha.10 pre-release.

LiamBindle · 2020-09-01T15:53:55Z

@lizziel I think you can do "SHELL:-check noarg_temp_created" to get it to work for ifort 18 and 19, if ifort 19 doesn't like the comma.

lizziel · 2020-09-01T22:03:49Z

Following up about the original issue, we have another report of a similar divide by zero floating point error while writing diagnostics:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
geos               0000000001FBCA6F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AD70A4335D0  Unknown               Unknown  Unknown
libnetcdf.so.13.0  00002AD706AE6A14  Unknown               Unknown  Unknown
libnetcdf.so.13.0  00002AD706AE4B4B  NC4_def_var           Unknown  Unknown
libnetcdf.so.13.0  00002AD706A10B5B  nc_def_var            Unknown  Unknown
libnetcdff.so.6.1  00002AD706524DB4  nf_def_var_           Unknown  Unknown
geos               0000000001B0E765  m_netcdf_io_defin         218  m_netcdf_io_define.F9\
0
geos               0000000001B62855  ncdf_mod_mp_nc_va        3866  ncdf_mod.F90
geos               000000000187B19E  history_netcdf_mo         465  history_netcdf_mod.F9\
0
geos               0000000001876EB6  history_mod_mp_hi        2925  history_mod.F90
geos               0000000000412C17  MAIN__                   2076  main.F90
geos               000000000040C4DE  Unknown               Unknown  Unknown
libc-2.17.so       00002AD70A8663D5  __libc_start_main     Unknown  Unknown
geos               000000000040C3E9  Unknown               Unknown  Unknown

This was using an older version of GCHPctm. It was fixed by switching to a different set of libraries, including netcdf. Try honing in on @sdeastham's suggestion:

Can you post the output of ifort --version, nc-config --all, and nf-config --all? It seems like something is going amiss deep in NetCDF.

I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?

joeylamcy · 2020-09-02T11:11:57Z

@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).
1. Can you verify that your copy of `MAPL_HistoryGridComp.F90` also has `call o_Clients%done_collective_stage()` on line 3570? That will give us a thread to tug on with GMAO.

Yes.

2. Can you post the output of `ifort --version`, `nc-config --all`, and `nf-config --all`? It seems like something is going amiss deep in NetCDF.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ifort --version
ifort (IFORT) 18.0.2 20180210
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nc-config --all

This netCDF 4.6.1 has been built with the following features: 

  --cc        -> icc
  --cflags    -> -I/opt/share/netcdf-4.6.1/include 
  --libs      -> -L/opt/share/netcdf-4.6.1/lib -lnetcdf

  --has-c++   -> no
  --cxx       -> 

  --has-c++4  -> no
  --cxx4      -> 

  --has-fortran-> yes
  --fc        -> ifort
  --fflags    -> -I/opt/share/netcdf-4.6.1/include
  --flibs     -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf
  --has-f90   -> no
  --has-f03   -> yes

  --has-dap   -> yes
  --has-dap4  -> yes
  --has-nc2   -> yes
  --has-nc4   -> yes
  --has-hdf5  -> yes
  --has-hdf4  -> no
  --has-logging-> no
  --has-pnetcdf-> no
  --has-szlib -> no
  --has-parallel -> no
  --has-cdf5 -> yes

  --prefix    -> /opt/share/netcdf-4.6.1
  --includedir-> /opt/share/netcdf-4.6.1/include
  --libdir    -> /opt/share/netcdf-4.6.1/lib
  --version   -> netCDF 4.6.1

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nf-config --all

This netCDF-Fortran 4.4.4 has been built with the following features: 

  --cc        -> icc
  --cflags    ->  -I/opt/share/netcdf-4.6.1/include 

  --fc        -> ifort
  --fflags    -> -I/opt/share/netcdf-4.6.1/include
  --flibs     -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf 
  --has-f90   -> no
  --has-f03   -> yes

  --has-nc2   -> yes
  --has-nc4   -> yes

  --prefix    -> /opt/share/netcdf-4.6.1
  --includedir-> /opt/share/netcdf-4.6.1/include
  --version   -> netCDF-Fortran 4.4.4

side note: During cmake .., there is an error saying that hdf5 is missing, so I manually export CMAKE_PREFIX_PATH=/opt/share/hdf5-1.10.2. If this matters, feel free to let me know and I can try reproducing that.

3. Are any output files generated in your OutputDir directory? If so, are they valid (i.e. what happens if you run `ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4`)?

HDF errors, and the sizes are obviously not right too. Meanwhile, ncdump-ing the MERRA-2 data is fine.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.DryDep.20160701_0030z.nc4 
ncdump: OutputDir/GCHP.DryDep.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4 
ncdump: OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ls -lh OutputDir/
total 12K
-rw-r--r--. 1 s1155064480 AmosTai 23 Aug 27 20:09 FILLER
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep  2 17:18 GCHP.DryDep.20160701_0030z.nc4
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep  2 17:18 GCHP.SpeciesConc.20160701_0030z.nc4

Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.

I changed duration to 1-hr and run start/end/duration to 1-hr as well, but the same floating divide by zero error occurs.

I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?

I didn't try other alpha versions. But we used them when building 12.8.2 of the old GCHP.

joeylamcy · 2020-09-02T12:59:22Z

I believe those temporary array warnings can be suppressed with -check,noarg_temp_created.

Unfortunately, as you suspected @sdeastham, I think manually adding -check,noarg_temp_created to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.

Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!

Yep, it works and only a few warnings are left before the error messages. Now I get the same errors as @lizziel did in GEOS-ESM/ESMA_cmake#125 (comment)

lizziel · 2020-09-02T13:42:28Z

I made a fix for the errors you are now getting with debug flags on. See GEOS-ESM/FVdycoreCubed_GridComp#71.

sdeastham · 2020-09-02T15:09:06Z

I'm very suspicious about the issue with HDF5 during cmake; @LiamBindle , any thoughts?

joeylamcy · 2020-09-02T15:39:17Z

I made a fix for the errors you are now getting with debug flags on. See GEOS-ESM/FVdycoreCubed_GridComp#71.

I'm getting those errors (from GetPointer.H and MAPL_Generic.F90) after the fix though.

lizziel · 2020-09-02T15:55:37Z

Did you also move the conditional for N <= ntracers? This solved that for me. Regardless, I now get past advection and am getting a new error in History. This is a problem in the GMAO MAPL library. I think it is safe to say using debug flags in GCHP is not yet ready. I am working with GMAO to get fixes for the bugs I am finding into their code.

I agree with @sdeastham that the focus for your issue should be on the netcdf/HDF5 library. Could you post your environment file, CMakeCache.txt, CMakeFiles/CMakeError.log, and CMakeFiles/CMakeOutput.log?

We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.

LiamBindle · 2020-09-02T16:18:34Z

I'm a bit suprised it didn't pick up HDF5 automatically considering it picked up NetCDF automatically, but @joeylamcy did the correct thing in pointing CMake to the appropriate HDF5 library with CMAKE_PREFIX_PATH.

The fact it's crashing in a nc_def_var call (deep in HISTORY) after writing 96 bytes, to me, suggests it's something obscure to do with NetCDF. The fact the simulation runs okay when output collections are turned off supports that too. It looks like the checkpoint file is being written okay, so it isn't consistent. I would agree with the suggestions to

Try a different version of NetCDF/NetCDF-Fortran
Increase the resources (I'm suprised 18 G is enough for C48)

joeylamcy · 2020-09-02T17:04:27Z

gchp_13.0.0.env.txt
CMakeCache.txt
CMakeError.log
CMakeOutput.log

netcdf-4.6.1 is sourced upon login. The shell script is as follow:

export NETCDF=/opt/share/netcdf-4.6.1
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include/:$INCLUDE

joeylamcy · 2020-09-02T17:44:25Z

We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.

It looks promising. Does spack need root access?

WilliamDowns · 2020-09-02T18:10:55Z

Spack does not require root access. Those instructions should be fine for getting setup with OpenMPI and GNU compilers; Intel MPI and/or Intel compilers also work but require a bit more setup that we haven't written out yet on the Wiki. You also won't need to manually define as many environment variables when loading NetCDF through Spack / other package managers. I've pasted a working environment file below (change SPACK_ROOT and ESMF_DIR as needed):

spack unload
export SPACK_ROOT=/path/to/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack load emacs
#==============================================================================
# %%%%% Load Spackages %%%%%
#==============================================================================
spack load gcc@9.3.0
spack load git%gcc@9.3.0
spack load cmake%gcc@9.3.0
spack load openmpi%gcc@9.3.0
spack load netcdf-fortran%gcc@9.3.0^openmpi

export MPI_ROOT=$(spack location -i openmpi)

# Make all files world-readable by default
umask 022

# Specify compilers
export CC=gcc
export CXX=g++
export FC=gfortran

# For ESMF
export ESMF_COMPILER=gfortran
export ESMF_COMM=openmpi
export ESMF_DIR=/path/to/ESMF
export ESMF_INSTALL_PREFIX=${ESMF_DIR}/INSTALL_openmpi_gfortran93
# For GCHP
export ESMF_ROOT=${ESMF_INSTALL_PREFIX}

#==============================================================================
# Set limits
#==============================================================================

#ulimit -c 0                      # coredumpsize
export OMP_STACKSIZE=500m
ulimit -l unlimited              # memorylocked
ulimit -u 50000                  # maxproc
ulimit -v unlimited              # vmemoryuse
ulimit -s unlimited              # stacksize

#==============================================================================
# Print information
#==============================================================================

#module list
echo ""
echo "Environment:"
echo ""
echo "CC: ${CC}"
echo "CXX: ${CXX}"
echo "FC: ${FC}"
echo "ESMF_COMM: ${ESMF_COMM}"
echo "ESMF_COMPILER: ${ESMF_COMPILER}"
echo "ESMF_DIR: ${ESMF_DIR}"
echo "ESMF_INSTALL_PREFIX: ${ESMF_INSTALL_PREFIX}"
echo "ESMF_ROOT: ${ESMF_ROOT}"
echo "MPI_ROOT: ${MPI_ROOT}"
echo "NetCDF C: $(nc-config --prefix)"
#echo "NetCDF Fortran: $(nf-config --prefix)"
echo ""
echo "Done sourcing ${BASH_SOURCE[0]}"

lizziel · 2020-09-02T18:43:39Z

I noticed in one of your outputs it lists this as your netcdf-fortran:
--version -> netCDF-Fortran 4.4.4

The user who had the same issue as you was actually using GEOS-Chem Classic. But he found this:

Update is that the simulation appears to have successfully finished using Lizzie's new environment file. So I guess the old environment file I used to use with GEOS-Chem classic no longer works with version 12.9.3.
My old environment file used netcdf-fortran/4.4.4-fasrc06 and yours uses netcdf-fortran/4.5.2-fasrc01

We definitely would love for you to try spack. Another route, however, is to see if you can get a newer netcdf-fortran version since at least one other person had an issue with 4.4.4 starting with GEOS-Chem 12.9.

joeylamcy · 2020-09-05T11:15:23Z

I see a newer netcdf-fortran version. Do I also need to rebuild ESMF?

WilliamDowns · 2020-09-05T13:40:18Z

EDIT: Sorry, I'm not actually sure if you need to rebuild ESMF specifically when changing NetCDF-Fortran libraries. The GCST will likely be away from this thread until Tuesday, so if you run into any more issues a rebuild of ESMF might help.

LiamBindle · 2020-09-05T14:47:31Z

@joeylamcy @WilliamDowns Yeah, you'll need to rebuild ESMF if you change NetCDF versions

joeylamcy · 2020-09-28T03:17:28Z

Just want to post an update:
I am able to finish a trial run with proper output using intel compilers 19.0.4, intel MPI, netcdf-c 4.7.1 and netcdf-fortran 4.5.2. However, I have not succeeded with any multi-node runs. Do you have any tested configuration of core counts and memory usage? Or perhaps any tips on multi-node runs in general?

lizziel · 2020-10-13T15:25:33Z

Hi @joeylamcy, does this issue only happen when outputting to a lat-lon grid?

joeylamcy · 2020-10-14T10:25:34Z

Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on Species_Conc collection. Attached are the outputs of ncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4 for simulations using 1 node and 2 nodes.
ncdump_lats_2nodes.txt
ncdump_lats_1node.txt

You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid.

EDIT: Instead of yes, I actually mean NO. I am using cubed sphere grid and the lats array are still wrong.

lizziel · 2020-10-14T13:44:49Z

Okay, I will see if I can reproduce given your lat/lon grid definition. Our standard testing currently does not include the lat/lon output option of MAPL so this very well may be a bug that went under the radar. I will report back when I have more information.

joeylamcy · 2020-10-14T13:48:55Z

Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on Species_Conc collection. Attached are the outputs of ncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4 for simulations using 1 node and 2 nodes.
ncdump_lats_2nodes.txt
ncdump_lats_1node.txt

You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid.
.

I'm sorry. I actually mean NO. The issue exist even when I used the default cubed-sphere grid. Sorry about misreading your question.

LiamBindle · 2020-10-14T13:49:02Z

To clarify, it appears that even with CS output the lats coordinates have some bad parts (a bunch of nearly zeros). Currently I'm seeing if I can reproduce the bad CS lats coordinates.

I'm trying

1 node, 6 cores, 100 GB mem
2 nodes, 12 cores, 100 GB mem per node.

I'll report back in a bit.

lizziel · 2020-10-14T13:51:22Z

Great, thanks @LiamBindle!

LiamBindle · 2020-10-14T17:36:32Z

@joeylamcy Sorry you're running into this--thank you for your patience. I suspect there's a bug somewhere that's causing this problem.

I've tried a bunch of configurations, and unfortunately I haven't been able to reproduce the problem. I've tried

1 node, 6 cores, 120 GB memory (Note: I had to increase to 120 GB because my sims crash with 100 GB)
2 nodes, 6 cores, 100 GB memory per node
A bunch of combinations of grid/conservative collections settings on 1 nodes and 2 nodes

Can you try running GCHP with this HISTORY.rc? Can you try this with a 1 node, 2 node, and 4 node simulation? Could you share the output for these?

Additional question: are you still using 13.0.0-alpha.9?

LiamBindle · 2020-10-14T18:03:44Z

It appears the the issue is with the coordinates, but the output data is okay. I downloaded GCHP.SpeciesConc_CS.20160701_0030z.nc4 which you shared above. Plotting it with its coordinates is bad, as you've reported.

import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr

# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()

ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()

# Plot data
for nf in range(6):
    x = ds['lons'].isel(nf=nf).values
    y = ds['lats'].isel(nf=nf).values
    v = da.isel(nf=nf).values
    plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)

plt.show()

However, if I plot SpeciesConc_O3 from your output, but I use lats and lons from one of my outputs it looks okay

import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr

# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()

ds_good_coords = xr.open_dataset('GCHP.MyTestCollectionNative.20160701_0030z.nc4')
ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()

# Plot data
for nf in range(6):
    x = ds_good_coords['lons'].isel(nf=nf).values
    y = ds_good_coords['lats'].isel(nf=nf).values
    v = da.isel(nf=nf).values
    plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)

plt.show()

So it appears it's the lats and lons coordinates that are bad. If you could run the simulations I suggested above, that might help us narrow in on the problem.

joeylamcy · 2020-10-14T18:10:50Z

@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.

Additional question: are you still using 13.0.0-alpha.9?

Yes.

LiamBindle · 2020-10-14T18:31:56Z

One last thing I should note. The lats and lons coordinates aren't yet well tested. GCPy, gcgridobj, and my own plotting scripts calculate grid-box coordinates externally. This is because if you want to plot CS data, you need grid-box corners, but they aren't included in the diagnostics yet (see #38). Corner coordinates will be in the diagnostics starting in 13.1. So starting in 13.1, you won't need post-process calculate grid-box corners.

It looks like your 2 node simulation's diagnostic were okay, with the exception of the lats and lons coordinates. If you want to start using GCHP immediately, a temporary workaround would be recalculating the coordinates post-simulation. That way you could start using GCHP immediately, but this obviously would just be a temporary work around. We definitely still need to figure out what's causing the bad coordinates. If you want to do this, let me know and I can follow up with some instructions.

@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.

Thanks, I'm looking forward to seeing the results.

joeylamcy · 2020-10-15T09:50:38Z

You can check the results on: https://mycuhk-my.sharepoint.com/:f:/g/personal/1155064480_link_cuhk_edu_hk/EpwESaXqXDlKuesfj6mhQ0wB0JgVhfh0EB1LSUd5Re_AJQ?e=0KU6Vg

EDIT: link edited.

LiamBindle · 2020-10-15T16:15:10Z

@joeylamcy I can't seem to open the link. Could you review that and let me know when I can try again?

joeylamcy · 2020-10-15T16:21:39Z

@LiamBindle My apologies. I have edited the permission settings. Please try again.

LiamBindle · 2020-10-15T16:24:05Z

Thanks, I can see them now! I'll follow up by the end of my day.

LiamBindle · 2020-10-15T21:56:31Z

Thanks @joeylamcy. Yeah this is interesting. It looks like something is going wrong with the lat and lon coordinates in the diagnostics (the rest of the diagnostic look okay). I suspect something subtle is happening in HISTORY, and so I've opened GEOS-ESM/MAPL#579.

lizziel · 2020-10-16T19:23:58Z

Hi @joeylamcy. Apologies if this has already been asked, but have you used your current environment successfully with older versions of GCHP?

joeylamcy · 2020-10-17T03:49:15Z

Yes. A very similar environment with minor changes is successful with GCHP v12.9.3. I'm able to obtain normal outputs of ozone concentration on a lat-lon grid.

lizziel · 2020-10-19T17:55:04Z

We have several versions of MAPL spread across the 13.0.0-alpha series. Would you be able to try GCHPctm 13.0.0-alpha.7? This uses an earlier version of MAPL than alpha.9 and would help determine if the problem came in with that update. Alpha versions that included updating to a new MAPL were 1, 3, 6, and 8.

joeylamcy · 2020-10-21T10:32:25Z

I have tried alpha.5 and alpha.7 thus far, but the issue exists for both versions.

joeylamcy · 2020-11-02T14:55:58Z

I have also tried alpha.1 today and the issue exists as well. I can't get the cmake part going for alpha.0. Do you think it is important to test on alpha.0?

LiamBindle · 2020-11-02T16:06:33Z

Hi Joey, thanks for checking that. I don't think it's important you check alpha.0. Knowing you see it in alpha.1 tells us this goes back a while, and that it hasn't been introduced recently. We've had some discussion about this internally and with the MAPL developers, and we're pretty stumped on what could be happening.

Since it appears the integrity of your diagnostic's variables is okay, I'd suggest you proceed with your GCHP simulations. Afterwards, recalculate the grid's coordinates for whereever you need them (e.g., for plotting). The easiest way to do this is with GCPy. The latest GCPy (dev/1.0.0 branch) has a command line tool for adding grid-box corners to an existing diagnostic file. For example, to add the corner coordinates to a diagnostic named GCHP.SpeciesConc.20180101z.nc4 you would do

$ python -m gcpy.append_grid_corners GCHP.SpeciesConc.20180101z.nc4

This adds the variables corner_lats and corner_lons to your dataset. You can use these new coordinates for plotting your data. Alternatively, if you use GCPy for plotting, it calculates the grid coordinates internally anyways, so you won't even need to do this step.

In GCHP 13.1.0, these corner coordinates are going to be included in the diagnostics automatically. You actually usually need these corner coordinates to plot GCHP data anyways (since it's a curvlinear grid, center coordinates aren't sufficient for plotting the data), which is why many of us (including GCPy) calculate these corner coordinates offline.

I know this is a bit unsatisfactory, but I think this is the best way to proceed. Let me know if you have any questions.

LiamBindle · 2020-11-05T15:36:02Z

@joeylamcy I'm going to close this. Please don't hesitate to open new issues if you run into problems/questions!

zsx-GitHub · 2021-09-27T09:10:52Z

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

@lizziel Hi Lizzie and all, I recently run into a similar issue. I enforced particulate matter concentration in GEOS-Chem to be at the measured value in the lowest 8 layers. After my revisions, the model can run smoothly if all HISTORY COLLECTIONS are turned off. The model will stop with 'forrtl: error (73): floating divide by zero' if any of the HISTORY COLLECTIONS (eg: SpeciesConc) is turned on. Does anyone have any ideas on how to fix this? Thank you!

Below I am pasting the out messages in slurm***.out:

forrtl: error (73): floating divide by zero
Image PC Routine Line Source
gcclassic 00000000011699CF Unknown Unknown Unknown
libpthread-2.17.s 00002ADA281F5630 Unknown Unknown Unknown
libnetcdf.so.7.2. 00002ADA27881D9C NC4_def_var Unknown Unknown
libnetcdf.so.7.2. 00002ADA27806FCB nc_def_var Unknown Unknown
libnetcdff.so.6.0 00002ADA2717A70C nf_def_var_ Unknown Unknown
gcclassic 0000000000FD5865 m_netcdf_io_defin 203 m_netcdf_io_define.F90
gcclassic 0000000001010914 ncdf_mod_mp_nc_va 3816 ncdf_mod.F90
gcclassic 000000000099AB0E history_netcdf_mo 465 history_netcdf_mod.F90
gcclassic 00000000009970C7 history_mod_mp_hi 2885 history_mod.F90
gcclassic 0000000000412E39 MAIN__ 2082 main.F90
gcclassic 000000000040C11E Unknown Unknown Unknown
libc-2.17.so 00002ADA28424555 __libc_start_main Unknown Unknown
gcclassic 000000000040C029 Unknown Unknown Unknown
/var/slurmd/spool/slurmd/job45642322/slurm_script: line 23: 94051 Aborted (core dumped) ./gcclassic - > $log

real 6m44.986s
user 104m8.109s
sys 1m46.121s

lizziel · 2021-09-27T14:41:22Z

@zsx-GitHub, please see new issue geoschem/geos-chem#917 which I created for your issue.

…ulation - geos-chem: Merge PR #2492 fixes for GCHP carbon simulation - MAPL: Merge PR #37 conatining fix to vertically flip imports with dimensionless pressure proxy lev coordinates Signed-off-by: Melissa Sulprizio <mpayer@seas.harvard.edu>

lizziel mentioned this issue Aug 31, 2020

String formatting error in AdvCore_GridCompMod.F90 GEOS-ESM/FVdycoreCubed_GridComp#71

Closed

lizziel mentioned this issue Sep 1, 2020

Debug flag support for different Intel compiler versions GEOS-ESM/ESMA_cmake#125

Closed

WilliamDowns mentioned this issue Oct 15, 2020

[BUG/ISSUE] Crash when using Intel MPI with certain fabric providers #47

Closed

LiamBindle mentioned this issue Oct 15, 2020

HISTORY problem with lat,lon coordinates in diagnostics of multi-node simulations GEOS-ESM/MAPL#579

Closed

lizziel changed the title ~~Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error.~~ [BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. Oct 21, 2020

LiamBindle closed this as completed Nov 5, 2020

lizziel mentioned this issue Sep 27, 2021

[BUG/ISSUE] Floating divide by zero when History collections turned on geoschem/geos-chem#917

Closed

[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. #37

[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. #37

Comments

joeylamcy commented Aug 29, 2020 • edited Loading

sdeastham commented Aug 30, 2020

joeylamcy commented Aug 30, 2020

sdeastham commented Aug 31, 2020 • edited Loading

lizziel commented Aug 31, 2020

lizziel commented Aug 31, 2020

joeylamcy commented Sep 1, 2020

joeylamcy commented Sep 1, 2020

sdeastham commented Sep 1, 2020

lizziel commented Sep 1, 2020

LiamBindle commented Sep 1, 2020

lizziel commented Sep 1, 2020

LiamBindle commented Sep 1, 2020

lizziel commented Sep 1, 2020

joeylamcy commented Sep 2, 2020

joeylamcy commented Sep 2, 2020

lizziel commented Sep 2, 2020

sdeastham commented Sep 2, 2020

joeylamcy commented Sep 2, 2020

lizziel commented Sep 2, 2020

LiamBindle commented Sep 2, 2020

joeylamcy commented Sep 2, 2020

joeylamcy commented Sep 2, 2020

WilliamDowns commented Sep 2, 2020

lizziel commented Sep 2, 2020 • edited Loading

joeylamcy commented Sep 5, 2020

WilliamDowns commented Sep 5, 2020 • edited Loading

LiamBindle commented Sep 5, 2020 • edited Loading

joeylamcy commented Sep 28, 2020

lizziel commented Oct 13, 2020

joeylamcy commented Oct 14, 2020 • edited Loading

lizziel commented Oct 14, 2020

joeylamcy commented Oct 14, 2020 • edited Loading

LiamBindle commented Oct 14, 2020 • edited Loading

lizziel commented Oct 14, 2020

LiamBindle commented Oct 14, 2020 • edited Loading

LiamBindle commented Oct 14, 2020 • edited Loading

joeylamcy commented Oct 14, 2020

LiamBindle commented Oct 14, 2020 • edited Loading

joeylamcy commented Oct 15, 2020 • edited Loading

LiamBindle commented Oct 15, 2020

joeylamcy commented Oct 15, 2020

LiamBindle commented Oct 15, 2020

LiamBindle commented Oct 15, 2020

lizziel commented Oct 16, 2020

joeylamcy commented Oct 17, 2020

lizziel commented Oct 19, 2020

joeylamcy commented Oct 21, 2020

joeylamcy commented Nov 2, 2020

LiamBindle commented Nov 2, 2020 • edited Loading

LiamBindle commented Nov 5, 2020

zsx-GitHub commented Sep 27, 2021

lizziel commented Sep 27, 2021

joeylamcy commented Aug 29, 2020 •

edited

Loading

sdeastham commented Aug 31, 2020 •

edited

Loading

lizziel commented Sep 2, 2020 •

edited

Loading

WilliamDowns commented Sep 5, 2020 •

edited

Loading

LiamBindle commented Sep 5, 2020 •

edited

Loading

joeylamcy commented Oct 14, 2020 •

edited

Loading

joeylamcy commented Oct 14, 2020 •

edited

Loading

LiamBindle commented Oct 14, 2020 •

edited

Loading

LiamBindle commented Oct 14, 2020 •

edited

Loading

LiamBindle commented Oct 14, 2020 •

edited

Loading

LiamBindle commented Oct 14, 2020 •

edited

Loading

joeylamcy commented Oct 15, 2020 •

edited

Loading

LiamBindle commented Nov 2, 2020 •

edited

Loading