Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. #37

Closed
joeylamcy opened this issue Aug 29, 2020 · 61 comments

Comments

@joeylamcy
Copy link

joeylamcy commented Aug 29, 2020

Hi everyone,

I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with forrtl: error (73): floating divide by zero. The full log files are attached below.
163214_print_out.log
163214_error.log

More information:

  • intel MPI with Intel 18 compiler
  • ESMF 8.0.0 public release built with ESMF_COMM=intelmpi

I'm not sure how to troubleshoot this issue. I tried to cmake the source code with -DCMAKE_BUILD_TYPE=Debug (with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:

forrtl: error (63): output conversion error, unit -5, file Internal Formatted Write
Image              PC                Routine            Line        Source
geos               00000000094A364E  Unknown               Unknown  Unknown
geos               00000000094F8D62  Unknown               Unknown  Unknown
geos               00000000094F6232  Unknown               Unknown  Unknown
geos               000000000226CC73  advcore_gridcompm         261  AdvCore_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos               0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos               0000000007F00B39  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               000000000844804D  Unknown               Unknown  Unknown
geos               0000000007EE2A0F  Unknown               Unknown  Unknown
geos               0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos               0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos               0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos               0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos               0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos               00000000004242FF  MAIN__                     29  GCHPctm.F90
geos               000000000042125E  Unknown               Unknown  Unknown
geos               000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002AFBC9F34505  __libc_start_main     Unknown  Unknown
geos               0000000000421169  Unknown               Unknown  Unknown

I also noticed something weird towards the start of the run:

      MAPL: No configure file specified for logging layer.  Using defaults. 
     SHMEM: NumCores per Node = 6
     SHMEM: NumNodes in use   = 1
     SHMEM: Total PEs         = 6
     SHMEM: NumNodes in use  = 1

Previous versions (12.8.2) usually shows this instead:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6


 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1

but I'm not sure if that matters.

@sdeastham
Copy link
Contributor

I suspect that the error in AdvCore_GridCompMod is misleading, but that is something we should fix. In AdvCore_GridCompMod.F90, ntracers is set to 11 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86). However, this leads to a formatting error, because a later loop for N = 1, ntracers tries to write to a string using a single-digit integer format with N-1 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L260-L269). The fix for that particular error is obvious - just set ntracers to 10 (ntracers doesn't seem to be a particularly important variable, and is only used to define these "test outputs").

I found that setting ntracers=10 does fix this error and allows you to find whatever the REAL error is. @lizziel we should raise this with GMAO and kick a pull request up the chain!

@joeylamcy
Copy link
Author

Oh, right. I checked the log for the debug run again and actually the run ended way earlier than without the debug flag, so I suppose using -DCMAKE_BUILD_TYPE=Debug couldn't help me.

@sdeastham
Copy link
Contributor

sdeastham commented Aug 31, 2020

It should help - you'll just need to fix the ntracers issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.

EDIT: By "fix the ntracers issue, I literally mean change the line ntracers = 11 to ntracers = 10 in AdvCore_GridCompMod.F90 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..

@lizziel
Copy link
Contributor

lizziel commented Aug 31, 2020

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

@lizziel
Copy link
Contributor

lizziel commented Aug 31, 2020

Regarding the debug flags issue, I created an issue on GEOS-ESM/FVdycoreCubed_GridComp: GEOS-ESM/FVdycoreCubed_GridComp#71.

@joeylamcy
Copy link
Author

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

Yes. If all collections in HISTORY.rc are commented out, the run continues smoothly. But turning on any number of the collections seems to cause the problem, i.e. not specific to one of the collections.

@joeylamcy
Copy link
Author

It should help - you'll just need to fix the ntracers issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.

EDIT: By "fix the ntracers issue, I literally mean change the line ntracers = 11 to ntracers = 10 in AdvCore_GridCompMod.F90 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..

Actually I tried it, but there is some further issues. The printout is still stuck at

NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
 ncnst=           0  num_prog=           0  pnats=           0  dnats=
           0  num_family=           0
 
 Grid distance at face edge (km)=   163384.217664128     

and the error log grows at a rate of ~100MB/min for at least 5 minutes, so I just manually stop the run. The leading error is still in AdvCore_GridCompMod.F90.

forrtl: warning (406): fort: (1): In call to MPI_GROUP_INCL, an array temporary was created for argument #3

Image              PC                Routine            Line        Source
geos.debug         00000000094A5440  Unknown               Unknown  Unknown
geos.debug         0000000005A6D950  mpp_mod_mp_get_pe         109  mpp_util_mpi.inc
geos.debug         0000000005A9EE8F  mpp_mod_mp_mpp_in          55  mpp_comm_mpi.inc
geos.debug         0000000004A0B3A6  fms_mod_mp_fms_in         342  fms.F90
geos.debug         000000000226E3A3  advcore_gridcompm         311  AdvCore_GridCompMod.F90
geos.debug         0000000007F00A0D  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         0000000007F01D4E  Unknown               Unknown  Unknown
geos.debug         0000000007F01A85  Unknown               Unknown  Unknown
geos.debug         0000000007EE1304  Unknown               Unknown  Unknown
geos.debug         0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos.debug         0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos.debug         0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos.debug         0000000007F00A0D  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         0000000007F01D4E  Unknown               Unknown  Unknown
geos.debug         0000000007F01A85  Unknown               Unknown  Unknown
geos.debug         0000000007EE1304  Unknown               Unknown  Unknown
geos.debug         0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos.debug         0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos.debug         0000000007F00B39  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         000000000844804D  Unknown               Unknown  Unknown
geos.debug         0000000007EE2A0F  Unknown               Unknown  Unknown
geos.debug         0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos.debug         0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos.debug         0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos.debug         0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos.debug         0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos.debug         00000000004242FF  MAIN__                     29  GCHPctm.F90
geos.debug         000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002B9C8AD6A505  __libc_start_main     Unknown  Unknown
geos.debug         0000000000421169  Unknown               Unknown  Unknown

@sdeastham
Copy link
Contributor

The array temporary warnings are irrelevant - given enough time, the code should still reach the actual error - but I agree that it's not really helpful to have them padding the error log. They do also very much slow down the run, so although the printout appears stuck it should eventually clear.

@LiamBindle - can you recommend a preferred way to suppress array temporary warnings in FV3 using CMake? I can imagine that one could do this by editing the contents of ESMA_cmake, but that seems non-ideal.

@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).

  1. Can you verify that your copy of MAPL_HistoryGridComp.F90 also has call o_Clients%done_collective_stage() on line 3570? That will give us a thread to tug on with GMAO.
  2. Can you post the output of ifort --version, nc-config --all, and nf-config --all? It seems like something is going amiss deep in NetCDF.
  3. Are any output files generated in your OutputDir directory? If so, are they valid (i.e. what happens if you run ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4)?

@lizziel
Copy link
Contributor

lizziel commented Sep 1, 2020

I noticed you are running a c48 standard simulation with 6 cores and 3G per core across 1 node, if the log file prints are to be trusted. It surprises me that the simulation ran without running out of memory. You can try upping your resources and lowering your to resolution c24 to see if that makes a difference at all for the diagnostics.

Also try commenting out individual collections to see if there is a specific history collection consistently causing the problem.

Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.

@LiamBindle
Copy link
Contributor

I believe those temporary array warnings can be suppressed with -check,noarg_temp_created.

Unfortunately, as you suspected @sdeastham, I think manually adding -check,noarg_temp_created to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.

Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!

@lizziel
Copy link
Contributor

lizziel commented Sep 1, 2020

I am going to put this update into the GCHPctm 13.00-alpha.10 pre-release.

@LiamBindle
Copy link
Contributor

@lizziel I think you can do "SHELL:-check noarg_temp_created" to get it to work for ifort 18 and 19, if ifort 19 doesn't like the comma.

@lizziel
Copy link
Contributor

lizziel commented Sep 1, 2020

Following up about the original issue, we have another report of a similar divide by zero floating point error while writing diagnostics:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
geos               0000000001FBCA6F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AD70A4335D0  Unknown               Unknown  Unknown
libnetcdf.so.13.0  00002AD706AE6A14  Unknown               Unknown  Unknown
libnetcdf.so.13.0  00002AD706AE4B4B  NC4_def_var           Unknown  Unknown
libnetcdf.so.13.0  00002AD706A10B5B  nc_def_var            Unknown  Unknown
libnetcdff.so.6.1  00002AD706524DB4  nf_def_var_           Unknown  Unknown
geos               0000000001B0E765  m_netcdf_io_defin         218  m_netcdf_io_define.F9\
0
geos               0000000001B62855  ncdf_mod_mp_nc_va        3866  ncdf_mod.F90
geos               000000000187B19E  history_netcdf_mo         465  history_netcdf_mod.F9\
0
geos               0000000001876EB6  history_mod_mp_hi        2925  history_mod.F90
geos               0000000000412C17  MAIN__                   2076  main.F90
geos               000000000040C4DE  Unknown               Unknown  Unknown
libc-2.17.so       00002AD70A8663D5  __libc_start_main     Unknown  Unknown
geos               000000000040C3E9  Unknown               Unknown  Unknown

This was using an older version of GCHPctm. It was fixed by switching to a different set of libraries, including netcdf. Try honing in on @sdeastham's suggestion:

Can you post the output of ifort --version, nc-config --all, and nf-config --all? It seems like something is going amiss deep in NetCDF.

I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?

@joeylamcy
Copy link
Author

@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).

1. Can you verify that your copy of `MAPL_HistoryGridComp.F90` also has `call o_Clients%done_collective_stage()` on line 3570? That will give us a thread to tug on with GMAO.

Yes.

2. Can you post the output of `ifort --version`, `nc-config --all`, and `nf-config --all`? It seems like something is going amiss deep in NetCDF.
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ifort --version
ifort (IFORT) 18.0.2 20180210
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nc-config --all

This netCDF 4.6.1 has been built with the following features: 

  --cc        -> icc
  --cflags    -> -I/opt/share/netcdf-4.6.1/include 
  --libs      -> -L/opt/share/netcdf-4.6.1/lib -lnetcdf

  --has-c++   -> no
  --cxx       -> 

  --has-c++4  -> no
  --cxx4      -> 

  --has-fortran-> yes
  --fc        -> ifort
  --fflags    -> -I/opt/share/netcdf-4.6.1/include
  --flibs     -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf
  --has-f90   -> no
  --has-f03   -> yes

  --has-dap   -> yes
  --has-dap4  -> yes
  --has-nc2   -> yes
  --has-nc4   -> yes
  --has-hdf5  -> yes
  --has-hdf4  -> no
  --has-logging-> no
  --has-pnetcdf-> no
  --has-szlib -> no
  --has-parallel -> no
  --has-cdf5 -> yes

  --prefix    -> /opt/share/netcdf-4.6.1
  --includedir-> /opt/share/netcdf-4.6.1/include
  --libdir    -> /opt/share/netcdf-4.6.1/lib
  --version   -> netCDF 4.6.1

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nf-config --all

This netCDF-Fortran 4.4.4 has been built with the following features: 

  --cc        -> icc
  --cflags    ->  -I/opt/share/netcdf-4.6.1/include 

  --fc        -> ifort
  --fflags    -> -I/opt/share/netcdf-4.6.1/include
  --flibs     -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf 
  --has-f90   -> no
  --has-f03   -> yes

  --has-nc2   -> yes
  --has-nc4   -> yes

  --prefix    -> /opt/share/netcdf-4.6.1
  --includedir-> /opt/share/netcdf-4.6.1/include
  --version   -> netCDF-Fortran 4.4.4

side note: During cmake .., there is an error saying that hdf5 is missing, so I manually export CMAKE_PREFIX_PATH=/opt/share/hdf5-1.10.2. If this matters, feel free to let me know and I can try reproducing that.

3. Are any output files generated in your OutputDir directory? If so, are they valid (i.e. what happens if you run `ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4`)?

HDF errors, and the sizes are obviously not right too. Meanwhile, ncdump-ing the MERRA-2 data is fine.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.DryDep.20160701_0030z.nc4 
ncdump: OutputDir/GCHP.DryDep.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4 
ncdump: OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ls -lh OutputDir/
total 12K
-rw-r--r--. 1 s1155064480 AmosTai 23 Aug 27 20:09 FILLER
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep  2 17:18 GCHP.DryDep.20160701_0030z.nc4
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep  2 17:18 GCHP.SpeciesConc.20160701_0030z.nc4

Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.

I changed duration to 1-hr and run start/end/duration to 1-hr as well, but the same floating divide by zero error occurs.

I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?

I didn't try other alpha versions. But we used them when building 12.8.2 of the old GCHP.

@joeylamcy
Copy link
Author

I believe those temporary array warnings can be suppressed with -check,noarg_temp_created.

Unfortunately, as you suspected @sdeastham, I think manually adding -check,noarg_temp_created to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.

Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!

Yep, it works and only a few warnings are left before the error messages. Now I get the same errors as @lizziel did in GEOS-ESM/ESMA_cmake#125 (comment)

@lizziel
Copy link
Contributor

lizziel commented Sep 2, 2020

I made a fix for the errors you are now getting with debug flags on. See GEOS-ESM/FVdycoreCubed_GridComp#71.

@sdeastham
Copy link
Contributor

I'm very suspicious about the issue with HDF5 during cmake; @LiamBindle , any thoughts?

@joeylamcy
Copy link
Author

I made a fix for the errors you are now getting with debug flags on. See GEOS-ESM/FVdycoreCubed_GridComp#71.

I'm getting those errors (from GetPointer.H and MAPL_Generic.F90) after the fix though.

@lizziel
Copy link
Contributor

lizziel commented Sep 2, 2020

Did you also move the conditional for N <= ntracers? This solved that for me. Regardless, I now get past advection and am getting a new error in History. This is a problem in the GMAO MAPL library. I think it is safe to say using debug flags in GCHP is not yet ready. I am working with GMAO to get fixes for the bugs I am finding into their code.

I agree with @sdeastham that the focus for your issue should be on the netcdf/HDF5 library. Could you post your environment file, CMakeCache.txt, CMakeFiles/CMakeError.log, and CMakeFiles/CMakeOutput.log?

We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.

@LiamBindle
Copy link
Contributor

I'm a bit suprised it didn't pick up HDF5 automatically considering it picked up NetCDF automatically, but @joeylamcy did the correct thing in pointing CMake to the appropriate HDF5 library with CMAKE_PREFIX_PATH.

The fact it's crashing in a nc_def_var call (deep in HISTORY) after writing 96 bytes, to me, suggests it's something obscure to do with NetCDF. The fact the simulation runs okay when output collections are turned off supports that too. It looks like the checkpoint file is being written okay, so it isn't consistent. I would agree with the suggestions to

  1. Try a different version of NetCDF/NetCDF-Fortran
  2. Increase the resources (I'm suprised 18 G is enough for C48)

@joeylamcy
Copy link
Author

gchp_13.0.0.env.txt
CMakeCache.txt
CMakeError.log
CMakeOutput.log

netcdf-4.6.1 is sourced upon login. The shell script is as follow:

export NETCDF=/opt/share/netcdf-4.6.1
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include/:$INCLUDE

@joeylamcy
Copy link
Author

We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.

It looks promising. Does spack need root access?

@WilliamDowns
Copy link
Contributor

Spack does not require root access. Those instructions should be fine for getting setup with OpenMPI and GNU compilers; Intel MPI and/or Intel compilers also work but require a bit more setup that we haven't written out yet on the Wiki. You also won't need to manually define as many environment variables when loading NetCDF through Spack / other package managers. I've pasted a working environment file below (change SPACK_ROOT and ESMF_DIR as needed):

spack unload
export SPACK_ROOT=/path/to/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack load emacs
#==============================================================================
# %%%%% Load Spackages %%%%%
#==============================================================================
spack load gcc@9.3.0
spack load git%gcc@9.3.0
spack load cmake%gcc@9.3.0
spack load openmpi%gcc@9.3.0
spack load netcdf-fortran%gcc@9.3.0^openmpi

export MPI_ROOT=$(spack location -i openmpi)

# Make all files world-readable by default
umask 022

# Specify compilers
export CC=gcc
export CXX=g++
export FC=gfortran

# For ESMF
export ESMF_COMPILER=gfortran
export ESMF_COMM=openmpi
export ESMF_DIR=/path/to/ESMF
export ESMF_INSTALL_PREFIX=${ESMF_DIR}/INSTALL_openmpi_gfortran93
# For GCHP
export ESMF_ROOT=${ESMF_INSTALL_PREFIX}

#==============================================================================
# Set limits
#==============================================================================

#ulimit -c 0                      # coredumpsize
export OMP_STACKSIZE=500m
ulimit -l unlimited              # memorylocked
ulimit -u 50000                  # maxproc
ulimit -v unlimited              # vmemoryuse
ulimit -s unlimited              # stacksize

#==============================================================================
# Print information
#==============================================================================

#module list
echo ""
echo "Environment:"
echo ""
echo "CC: ${CC}"
echo "CXX: ${CXX}"
echo "FC: ${FC}"
echo "ESMF_COMM: ${ESMF_COMM}"
echo "ESMF_COMPILER: ${ESMF_COMPILER}"
echo "ESMF_DIR: ${ESMF_DIR}"
echo "ESMF_INSTALL_PREFIX: ${ESMF_INSTALL_PREFIX}"
echo "ESMF_ROOT: ${ESMF_ROOT}"
echo "MPI_ROOT: ${MPI_ROOT}"
echo "NetCDF C: $(nc-config --prefix)"
#echo "NetCDF Fortran: $(nf-config --prefix)"
echo ""
echo "Done sourcing ${BASH_SOURCE[0]}"

@lizziel
Copy link
Contributor

lizziel commented Sep 2, 2020

I noticed in one of your outputs it lists this as your netcdf-fortran:
--version -> netCDF-Fortran 4.4.4

The user who had the same issue as you was actually using GEOS-Chem Classic. But he found this:

Update is that the simulation appears to have successfully finished using Lizzie's new environment file. So I guess the old environment file I used to use with GEOS-Chem classic no longer works with version 12.9.3.
My old environment file used netcdf-fortran/4.4.4-fasrc06 and yours uses netcdf-fortran/4.5.2-fasrc01

We definitely would love for you to try spack. Another route, however, is to see if you can get a newer netcdf-fortran version since at least one other person had an issue with 4.4.4 starting with GEOS-Chem 12.9.

@joeylamcy
Copy link
Author

I see a newer netcdf-fortran version. Do I also need to rebuild ESMF?

@WilliamDowns
Copy link
Contributor

WilliamDowns commented Sep 5, 2020

EDIT: Sorry, I'm not actually sure if you need to rebuild ESMF specifically when changing NetCDF-Fortran libraries. The GCST will likely be away from this thread until Tuesday, so if you run into any more issues a rebuild of ESMF might help.

@LiamBindle
Copy link
Contributor

LiamBindle commented Sep 5, 2020

@joeylamcy @WilliamDowns Yeah, you'll need to rebuild ESMF if you change NetCDF versions

@joeylamcy
Copy link
Author

Just want to post an update:
I am able to finish a trial run with proper output using intel compilers 19.0.4, intel MPI, netcdf-c 4.7.1 and netcdf-fortran 4.5.2. However, I have not succeeded with any multi-node runs. Do you have any tested configuration of core counts and memory usage? Or perhaps any tips on multi-node runs in general?

@lizziel
Copy link
Contributor

lizziel commented Oct 13, 2020

Hi @joeylamcy, does this issue only happen when outputting to a lat-lon grid?

@joeylamcy
Copy link
Author

joeylamcy commented Oct 14, 2020

Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on Species_Conc collection. Attached are the outputs of ncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4 for simulations using 1 node and 2 nodes.
ncdump_lats_2nodes.txt
ncdump_lats_1node.txt

You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid.

EDIT: Instead of yes, I actually mean NO. I am using cubed sphere grid and the lats array are still wrong.

@lizziel
Copy link
Contributor

lizziel commented Oct 14, 2020

Okay, I will see if I can reproduce given your lat/lon grid definition. Our standard testing currently does not include the lat/lon output option of MAPL so this very well may be a bug that went under the radar. I will report back when I have more information.

@joeylamcy
Copy link
Author

joeylamcy commented Oct 14, 2020

Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on Species_Conc collection. Attached are the outputs of ncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4 for simulations using 1 node and 2 nodes.
ncdump_lats_2nodes.txt
ncdump_lats_1node.txt

You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid.
.

I'm sorry. I actually mean NO. The issue exist even when I used the default cubed-sphere grid. Sorry about misreading your question.

@LiamBindle
Copy link
Contributor

LiamBindle commented Oct 14, 2020

To clarify, it appears that even with CS output the lats coordinates have some bad parts (a bunch of nearly zeros). Currently I'm seeing if I can reproduce the bad CS lats coordinates.

I'm trying

  • 1 node, 6 cores, 100 GB mem
  • 2 nodes, 12 cores, 100 GB mem per node.

I'll report back in a bit.

@lizziel
Copy link
Contributor

lizziel commented Oct 14, 2020

Great, thanks @LiamBindle!

@LiamBindle
Copy link
Contributor

LiamBindle commented Oct 14, 2020

@joeylamcy Sorry you're running into this--thank you for your patience. I suspect there's a bug somewhere that's causing this problem.

I've tried a bunch of configurations, and unfortunately I haven't been able to reproduce the problem. I've tried

  • 1 node, 6 cores, 120 GB memory (Note: I had to increase to 120 GB because my sims crash with 100 GB)
  • 2 nodes, 6 cores, 100 GB memory per node
  • A bunch of combinations of grid/conservative collections settings on 1 nodes and 2 nodes

Can you try running GCHP with this HISTORY.rc? Can you try this with a 1 node, 2 node, and 4 node simulation? Could you share the output for these?

Additional question: are you still using 13.0.0-alpha.9?

@LiamBindle
Copy link
Contributor

LiamBindle commented Oct 14, 2020

It appears the the issue is with the coordinates, but the output data is okay. I downloaded GCHP.SpeciesConc_CS.20160701_0030z.nc4 which you shared above. Plotting it with its coordinates is bad, as you've reported.

import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr

# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()

ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()

# Plot data
for nf in range(6):
    x = ds['lons'].isel(nf=nf).values
    y = ds['lats'].isel(nf=nf).values
    v = da.isel(nf=nf).values
    plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)

plt.show()

image

However, if I plot SpeciesConc_O3 from your output, but I use lats and lons from one of my outputs it looks okay

import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr

# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()

ds_good_coords = xr.open_dataset('GCHP.MyTestCollectionNative.20160701_0030z.nc4')
ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()

# Plot data
for nf in range(6):
    x = ds_good_coords['lons'].isel(nf=nf).values
    y = ds_good_coords['lats'].isel(nf=nf).values
    v = da.isel(nf=nf).values
    plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)

plt.show()

Figure_1

So it appears it's the lats and lons coordinates that are bad. If you could run the simulations I suggested above, that might help us narrow in on the problem.

@joeylamcy
Copy link
Author

@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.

Additional question: are you still using 13.0.0-alpha.9?

Yes.

@LiamBindle
Copy link
Contributor

LiamBindle commented Oct 14, 2020

One last thing I should note. The lats and lons coordinates aren't yet well tested. GCPy, gcgridobj, and my own plotting scripts calculate grid-box coordinates externally. This is because if you want to plot CS data, you need grid-box corners, but they aren't included in the diagnostics yet (see #38). Corner coordinates will be in the diagnostics starting in 13.1. So starting in 13.1, you won't need post-process calculate grid-box corners.

It looks like your 2 node simulation's diagnostic were okay, with the exception of the lats and lons coordinates. If you want to start using GCHP immediately, a temporary workaround would be recalculating the coordinates post-simulation. That way you could start using GCHP immediately, but this obviously would just be a temporary work around. We definitely still need to figure out what's causing the bad coordinates. If you want to do this, let me know and I can follow up with some instructions.

@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.

Thanks, I'm looking forward to seeing the results.

@joeylamcy
Copy link
Author

joeylamcy commented Oct 15, 2020

@LiamBindle
Copy link
Contributor

@joeylamcy I can't seem to open the link. Could you review that and let me know when I can try again?

@joeylamcy
Copy link
Author

@LiamBindle My apologies. I have edited the permission settings. Please try again.

@LiamBindle
Copy link
Contributor

Thanks, I can see them now! I'll follow up by the end of my day.

@LiamBindle
Copy link
Contributor

Thanks @joeylamcy. Yeah this is interesting. It looks like something is going wrong with the lat and lon coordinates in the diagnostics (the rest of the diagnostic look okay). I suspect something subtle is happening in HISTORY, and so I've opened GEOS-ESM/MAPL#579.

@lizziel
Copy link
Contributor

lizziel commented Oct 16, 2020

Hi @joeylamcy. Apologies if this has already been asked, but have you used your current environment successfully with older versions of GCHP?

@joeylamcy
Copy link
Author

Yes. A very similar environment with minor changes is successful with GCHP v12.9.3. I'm able to obtain normal outputs of ozone concentration on a lat-lon grid.

@lizziel
Copy link
Contributor

lizziel commented Oct 19, 2020

We have several versions of MAPL spread across the 13.0.0-alpha series. Would you be able to try GCHPctm 13.0.0-alpha.7? This uses an earlier version of MAPL than alpha.9 and would help determine if the problem came in with that update. Alpha versions that included updating to a new MAPL were 1, 3, 6, and 8.

@joeylamcy
Copy link
Author

I have tried alpha.5 and alpha.7 thus far, but the issue exists for both versions.

@lizziel lizziel changed the title Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. [BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. Oct 21, 2020
@joeylamcy
Copy link
Author

I have also tried alpha.1 today and the issue exists as well. I can't get the cmake part going for alpha.0. Do you think it is important to test on alpha.0?

@LiamBindle
Copy link
Contributor

LiamBindle commented Nov 2, 2020

Hi Joey, thanks for checking that. I don't think it's important you check alpha.0. Knowing you see it in alpha.1 tells us this goes back a while, and that it hasn't been introduced recently. We've had some discussion about this internally and with the MAPL developers, and we're pretty stumped on what could be happening.

Since it appears the integrity of your diagnostic's variables is okay, I'd suggest you proceed with your GCHP simulations. Afterwards, recalculate the grid's coordinates for whereever you need them (e.g., for plotting). The easiest way to do this is with GCPy. The latest GCPy (dev/1.0.0 branch) has a command line tool for adding grid-box corners to an existing diagnostic file. For example, to add the corner coordinates to a diagnostic named GCHP.SpeciesConc.20180101z.nc4 you would do

$ python -m gcpy.append_grid_corners GCHP.SpeciesConc.20180101z.nc4

This adds the variables corner_lats and corner_lons to your dataset. You can use these new coordinates for plotting your data. Alternatively, if you use GCPy for plotting, it calculates the grid coordinates internally anyways, so you won't even need to do this step.

In GCHP 13.1.0, these corner coordinates are going to be included in the diagnostics automatically. You actually usually need these corner coordinates to plot GCHP data anyways (since it's a curvlinear grid, center coordinates aren't sufficient for plotting the data), which is why many of us (including GCPy) calculate these corner coordinates offline.

I know this is a bit unsatisfactory, but I think this is the best way to proceed. Let me know if you have any questions.

@LiamBindle
Copy link
Contributor

@joeylamcy I'm going to close this. Please don't hesitate to open new issues if you run into problems/questions!

@zsx-GitHub
Copy link

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

@lizziel Hi Lizzie and all, I recently run into a similar issue. I enforced particulate matter concentration in GEOS-Chem to be at the measured value in the lowest 8 layers. After my revisions, the model can run smoothly if all HISTORY COLLECTIONS are turned off. The model will stop with 'forrtl: error (73): floating divide by zero' if any of the HISTORY COLLECTIONS (eg: SpeciesConc) is turned on. Does anyone have any ideas on how to fix this? Thank you!

Below I am pasting the out messages in slurm***.out:

forrtl: error (73): floating divide by zero
Image PC Routine Line Source
gcclassic 00000000011699CF Unknown Unknown Unknown
libpthread-2.17.s 00002ADA281F5630 Unknown Unknown Unknown
libnetcdf.so.7.2. 00002ADA27881D9C NC4_def_var Unknown Unknown
libnetcdf.so.7.2. 00002ADA27806FCB nc_def_var Unknown Unknown
libnetcdff.so.6.0 00002ADA2717A70C nf_def_var_ Unknown Unknown
gcclassic 0000000000FD5865 m_netcdf_io_defin 203 m_netcdf_io_define.F90
gcclassic 0000000001010914 ncdf_mod_mp_nc_va 3816 ncdf_mod.F90
gcclassic 000000000099AB0E history_netcdf_mo 465 history_netcdf_mod.F90
gcclassic 00000000009970C7 history_mod_mp_hi 2885 history_mod.F90
gcclassic 0000000000412E39 MAIN__ 2082 main.F90
gcclassic 000000000040C11E Unknown Unknown Unknown
libc-2.17.so 00002ADA28424555 __libc_start_main Unknown Unknown
gcclassic 000000000040C029 Unknown Unknown Unknown
/var/slurmd/spool/slurmd/job45642322/slurm_script: line 23: 94051 Aborted (core dumped) ./gcclassic - > $log

real 6m44.986s
user 104m8.109s
sys 1m46.121s

@lizziel
Copy link
Contributor

lizziel commented Sep 27, 2021

@zsx-GitHub, please see new issue geoschem/geos-chem#917 which I created for your issue.

msulprizio added a commit that referenced this issue Oct 3, 2024
…ulation

- geos-chem: Merge PR #2492 fixes for GCHP carbon simulation
- MAPL: Merge PR #37 conatining fix to vertically flip imports with dimensionless pressure proxy lev coordinates

Signed-off-by: Melissa Sulprizio <mpayer@seas.harvard.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants