-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. #37
Comments
I suspect that the error in AdvCore_GridCompMod is misleading, but that is something we should fix. In I found that setting |
Oh, right. I checked the log for the debug run again and actually the run ended way earlier than without the debug flag, so I suppose using |
It should help - you'll just need to fix the EDIT: By "fix the |
Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem. |
Regarding the debug flags issue, I created an issue on GEOS-ESM/FVdycoreCubed_GridComp: GEOS-ESM/FVdycoreCubed_GridComp#71. |
Yes. If all collections in HISTORY.rc are commented out, the run continues smoothly. But turning on any number of the collections seems to cause the problem, i.e. not specific to one of the collections. |
Actually I tried it, but there is some further issues. The printout is still stuck at
and the error log grows at a rate of ~100MB/min for at least 5 minutes, so I just manually stop the run. The leading error is still in
|
The array temporary warnings are irrelevant - given enough time, the code should still reach the actual error - but I agree that it's not really helpful to have them padding the error log. They do also very much slow down the run, so although the printout appears stuck it should eventually clear. @LiamBindle - can you recommend a preferred way to suppress array temporary warnings in FV3 using CMake? I can imagine that one could do this by editing the contents of @joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).
|
I noticed you are running a c48 standard simulation with 6 cores and 3G per core across 1 node, if the log file prints are to be trusted. It surprises me that the simulation ran without running out of memory. You can try upping your resources and lowering your to resolution c24 to see if that makes a difference at all for the diagnostics. Also try commenting out individual collections to see if there is a specific history collection consistently causing the problem. Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything. |
I believe those temporary array warnings can be suppressed with Unfortunately, as you suspected @sdeastham, I think manually adding Let me know if you run into any problems suppressing those temporary array warning @joeylamcy! |
I am going to put this update into the GCHPctm 13.00-alpha.10 pre-release. |
@lizziel I think you can do |
Following up about the original issue, we have another report of a similar divide by zero floating point error while writing diagnostics:
This was using an older version of GCHPctm. It was fixed by switching to a different set of libraries, including netcdf. Try honing in on @sdeastham's suggestion:
I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one? |
Yes.
side note: During
HDF errors, and the sizes are obviously not right too. Meanwhile, ncdump-ing the MERRA-2 data is fine.
I changed duration to 1-hr and run start/end/duration to 1-hr as well, but the same floating divide by zero error occurs.
I didn't try other alpha versions. But we used them when building 12.8.2 of the old GCHP. |
Yep, it works and only a few warnings are left before the error messages. Now I get the same errors as @lizziel did in GEOS-ESM/ESMA_cmake#125 (comment) |
I made a fix for the errors you are now getting with debug flags on. See GEOS-ESM/FVdycoreCubed_GridComp#71. |
I'm very suspicious about the issue with HDF5 during cmake; @LiamBindle , any thoughts? |
I'm getting those errors (from GetPointer.H and MAPL_Generic.F90) after the fix though. |
Did you also move the conditional for N <= ntracers? This solved that for me. Regardless, I now get past advection and am getting a new error in History. This is a problem in the GMAO MAPL library. I think it is safe to say using debug flags in GCHP is not yet ready. I am working with GMAO to get fixes for the bugs I am finding into their code. I agree with @sdeastham that the focus for your issue should be on the netcdf/HDF5 library. Could you post your environment file, CMakeCache.txt, CMakeFiles/CMakeError.log, and CMakeFiles/CMakeOutput.log? We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue. |
I'm a bit suprised it didn't pick up HDF5 automatically considering it picked up NetCDF automatically, but @joeylamcy did the correct thing in pointing CMake to the appropriate HDF5 library with The fact it's crashing in a
|
gchp_13.0.0.env.txt netcdf-4.6.1 is sourced upon login. The shell script is as follow:
|
It looks promising. Does spack need root access? |
Spack does not require root access. Those instructions should be fine for getting setup with OpenMPI and GNU compilers; Intel MPI and/or Intel compilers also work but require a bit more setup that we haven't written out yet on the Wiki. You also won't need to manually define as many environment variables when loading NetCDF through Spack / other package managers. I've pasted a working environment file below (change SPACK_ROOT and ESMF_DIR as needed):
|
I noticed in one of your outputs it lists this as your netcdf-fortran: The user who had the same issue as you was actually using GEOS-Chem Classic. But he found this:
We definitely would love for you to try spack. Another route, however, is to see if you can get a newer netcdf-fortran version since at least one other person had an issue with 4.4.4 starting with GEOS-Chem 12.9. |
I see a newer netcdf-fortran version. Do I also need to rebuild ESMF? |
EDIT: Sorry, I'm not actually sure if you need to rebuild ESMF specifically when changing NetCDF-Fortran libraries. The GCST will likely be away from this thread until Tuesday, so if you run into any more issues a rebuild of ESMF might help. |
@joeylamcy @WilliamDowns Yeah, you'll need to rebuild ESMF if you change NetCDF versions |
Just want to post an update: |
Hi @joeylamcy, does this issue only happen when outputting to a lat-lon grid? |
Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid. EDIT: Instead of yes, I actually mean NO. I am using cubed sphere grid and the lats array are still wrong. |
Okay, I will see if I can reproduce given your lat/lon grid definition. Our standard testing currently does not include the lat/lon output option of MAPL so this very well may be a bug that went under the radar. I will report back when I have more information. |
I'm sorry. I actually mean NO. The issue exist even when I used the default cubed-sphere grid. Sorry about misreading your question. |
To clarify, it appears that even with CS output the I'm trying
I'll report back in a bit. |
Great, thanks @LiamBindle! |
@joeylamcy Sorry you're running into this--thank you for your patience. I suspect there's a bug somewhere that's causing this problem. I've tried a bunch of configurations, and unfortunately I haven't been able to reproduce the problem. I've tried
Can you try running GCHP with this HISTORY.rc? Can you try this with a 1 node, 2 node, and 4 node simulation? Could you share the output for these? Additional question: are you still using 13.0.0-alpha.9? |
It appears the the issue is with the coordinates, but the output data is okay. I downloaded import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr
# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()
ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()
# Plot data
for nf in range(6):
x = ds['lons'].isel(nf=nf).values
y = ds['lats'].isel(nf=nf).values
v = da.isel(nf=nf).values
plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)
plt.show() However, if I plot SpeciesConc_O3 from your output, but I use import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr
# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()
ds_good_coords = xr.open_dataset('GCHP.MyTestCollectionNative.20160701_0030z.nc4')
ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()
# Plot data
for nf in range(6):
x = ds_good_coords['lons'].isel(nf=nf).values
y = ds_good_coords['lats'].isel(nf=nf).values
v = da.isel(nf=nf).values
plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)
plt.show() So it appears it's the |
@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.
Yes. |
One last thing I should note. The It looks like your 2 node simulation's diagnostic were okay, with the exception of the
Thanks, I'm looking forward to seeing the results. |
You can check the results on: https://mycuhk-my.sharepoint.com/:f:/g/personal/1155064480_link_cuhk_edu_hk/EpwESaXqXDlKuesfj6mhQ0wB0JgVhfh0EB1LSUd5Re_AJQ?e=0KU6Vg EDIT: link edited. |
@joeylamcy I can't seem to open the link. Could you review that and let me know when I can try again? |
@LiamBindle My apologies. I have edited the permission settings. Please try again. |
Thanks, I can see them now! I'll follow up by the end of my day. |
Thanks @joeylamcy. Yeah this is interesting. It looks like something is going wrong with the |
Hi @joeylamcy. Apologies if this has already been asked, but have you used your current environment successfully with older versions of GCHP? |
Yes. A very similar environment with minor changes is successful with GCHP v12.9.3. I'm able to obtain normal outputs of ozone concentration on a lat-lon grid. |
We have several versions of MAPL spread across the 13.0.0-alpha series. Would you be able to try GCHPctm 13.0.0-alpha.7? This uses an earlier version of MAPL than alpha.9 and would help determine if the problem came in with that update. Alpha versions that included updating to a new MAPL were 1, 3, 6, and 8. |
I have tried alpha.5 and alpha.7 thus far, but the issue exists for both versions. |
I have also tried alpha.1 today and the issue exists as well. I can't get the cmake part going for alpha.0. Do you think it is important to test on alpha.0? |
Hi Joey, thanks for checking that. I don't think it's important you check alpha.0. Knowing you see it in alpha.1 tells us this goes back a while, and that it hasn't been introduced recently. We've had some discussion about this internally and with the MAPL developers, and we're pretty stumped on what could be happening. Since it appears the integrity of your diagnostic's variables is okay, I'd suggest you proceed with your GCHP simulations. Afterwards, recalculate the grid's coordinates for whereever you need them (e.g., for plotting). The easiest way to do this is with GCPy. The latest GCPy ( $ python -m gcpy.append_grid_corners GCHP.SpeciesConc.20180101z.nc4 This adds the variables In GCHP 13.1.0, these corner coordinates are going to be included in the diagnostics automatically. You actually usually need these corner coordinates to plot GCHP data anyways (since it's a curvlinear grid, center coordinates aren't sufficient for plotting the data), which is why many of us (including GCPy) calculate these corner coordinates offline. I know this is a bit unsatisfactory, but I think this is the best way to proceed. Let me know if you have any questions. |
@joeylamcy I'm going to close this. Please don't hesitate to open new issues if you run into problems/questions! |
@lizziel Hi Lizzie and all, I recently run into a similar issue. I enforced particulate matter concentration in GEOS-Chem to be at the measured value in the lowest 8 layers. After my revisions, the model can run smoothly if all HISTORY COLLECTIONS are turned off. The model will stop with 'forrtl: error (73): floating divide by zero' if any of the HISTORY COLLECTIONS (eg: SpeciesConc) is turned on. Does anyone have any ideas on how to fix this? Thank you! Below I am pasting the out messages in slurm***.out: forrtl: error (73): floating divide by zero real 6m44.986s |
@zsx-GitHub, please see new issue geoschem/geos-chem#917 which I created for your issue. |
…ulation - geos-chem: Merge PR #2492 fixes for GCHP carbon simulation - MAPL: Merge PR #37 conatining fix to vertically flip imports with dimensionless pressure proxy lev coordinates Signed-off-by: Melissa Sulprizio <mpayer@seas.harvard.edu>
Hi everyone,
I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with
forrtl: error (73): floating divide by zero
. The full log files are attached below.163214_print_out.log
163214_error.log
More information:
ESMF_COMM=intelmpi
I'm not sure how to troubleshoot this issue. I tried to cmake the source code with
-DCMAKE_BUILD_TYPE=Debug
(with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:I also noticed something weird towards the start of the run:
Previous versions (12.8.2) usually shows this instead:
but I'm not sure if that matters.
The text was updated successfully, but these errors were encountered: