Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FVCOM_tools fails when using GNU #596

Closed
dmwright526 opened this issue Nov 5, 2021 · 21 comments
Closed

FVCOM_tools fails when using GNU #596

dmwright526 opened this issue Nov 5, 2021 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@dmwright526
Copy link
Collaborator

FVCOM_tools and the corresponding unit test creates a segfault when running the code using the GNU compilers. Code was originally tested and developed using Intel compilers, so the exact reason for the error is unclear. The segfault points to the fcst_nwp object and potentially one of the pointers within the object.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeVandenberghe-NOAA I know you are very busy, but could we get a bit of your time? @dmwright526 is getting an odd seg fault when running with GNU and we are looking for some guidance. I will let him explain the problem.

@kgerheiser
Copy link
Contributor

I would recommend compiling and running with debug flags -g -fbacktrace -fcheck=bounds to see if it catches anything. And probably worth it to try with Intel too -g -traceback -check all.

@kgerheiser
Copy link
Contributor

kgerheiser commented Jan 14, 2022

Trying to build the tests, but the function call is missing some arguments:

  125 |  call fcst%read_n(trim(fvcomfile),' FVCOM',wcstart,lbclon,lbclat,lbctimes,t2,lbcmask,lbcsst,lbcice,lbcsfcT,lbciceT,fv3sfcTl)
      |                                                                                                                            1
Error: Missing actual argument for argument 'zorl' at (1)

And compared to the function defintion:

 read_nwp(this,filename,itype,wcstart,numlon,numlat,numtimes,time_to_get,mask,sst,ice,sfcT,iceT,sfcTl,zorl,hice)

Looks like it's missing 2 arguments: zorl and hice.

@dmwright526
Copy link
Collaborator Author

@kgerheiser Use this hash instead:
2a08859

This was before the recent update we had to push through quickly due to wrong initialization of surface roughness length causing the MYNN scheme to crash. It will take me a bit to update the tests and dummy data to include the fields, so this will at least allow to see if you can diagnose the problem.

@kgerheiser
Copy link
Contributor

Thanks, I was able to build and reproduce the problem. Intel debug flags didn't catch anything.

@GeorgeGayno-NOAA
Copy link
Collaborator

Thanks, I was able to build and reproduce the problem. Intel debug flags didn't catch anything.

@kgerheiser We know you are super busy. Any progress on this? Should we ask @arunchawla-NOAA for some of your time?

@edwardhartnett
Copy link
Collaborator

Seems like this is a problem in the FVCOM code and should be fixed by the programmers who submitted the code with the missing parameter. Or am I missing something...

@dmwright526
Copy link
Collaborator Author

@edwardhartnett This is legacy code that was developed for the HRRR to ingest the lake surface conditions. It was developed purely on NOAA machines with Intel compilers and porting it to GNU is an unexpected challenge without an obvious solution. The code was pushed to get this feature into RRFS development sooner for testing over the winter.

I am putting more time into it this week to try to find a solution.

@GeorgeVandenberghe-NOAA
Copy link
Contributor

Who is the original author(s) of the code. The problem in testing reported, was a binding mismatch the intel compiler let slide (for now). But it's fundamental. If the original author isn't available we will probably have to do a thorough analysis of the code for other possible errors Intel is letting slide and gnu is catching. I don't have a lot of time until after WCOSS2 goes live, after that I should be able to go after it.

@dmwright526
Copy link
Collaborator Author

Thanks @GeorgeVandenberghe-NOAA. Like As I mentioned above, this code is modified from HRRR for ingesting lake surface conditions which was developed on Intel compilers. While it is likely I added additional errors to the code, to my knowledge it was never ported to GNU.

@edwardhartnett
Copy link
Collaborator

OK, have you tried with the address santizer in GNU? Build with CFLAGS='-g -fsanitize=address -fno-omit-frame-pointer' and rerun the test. You will get much more info about why it fails.

@edwardhartnett
Copy link
Collaborator

Is there a PR up that demonstrates this problem?

Aren't the FVCOM tests run in CI? If so, how come CI is passing? If not, can we get them to run in CI? Then we could all see the failure, and also we already have address sanitizer running in CI...

@dmwright526
Copy link
Collaborator Author

The FVCOM unit tests are currently commented out in the CMakeList.txt file in the test directory as to not keep raising the known error with GNU. Adding this line back in will allow the unit test to be compiled and then run to show the error.

The address sanitizer flags, as I tested them, only specifies the fcst_nwp object (called named "fcst" in the code) as the culprit. I have tried digging deeper with both the flags mentioned in this thread and running through the GDB debugger, but have yet to be successful in isolating the problem.

@dmwright526
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA @edwardhartnett @kgerheiser Do any of you have availability to take a look at this bug some more? I have been attempting to debug it over the past few days using the unit test (tests/fvcom_tools).

When compiling with both Intel and GNU, the .nc files are able to be read in correctly in the sorc/fvcom_tools/module_ncio.f90 (routines used for directly reading in netCDF data) and sorc/fvcom_tools/module_nwp.f90 (routines used to populate specific surface variables). The unit test will segfault with GNU when attempting to use variables (fv3sst, fv3ice, fv3iceT, etc.) passed out of sorc/fvcom_tools/module_nwp.f90 (subroutine read_nwp) to tests/fvcom_tools/ftst_readfvcomnetcdf.F90.

@edwardhartnett
Copy link
Collaborator

Can you please rebuild with both C and Fortran FLAGS including -fsantize=address?

@dmwright526
Copy link
Collaborator Author

@edwardhartnett thank you for the suggestion! I tried with this flag on, but it did not change the segfault output or give any more information on the problem within the object.

@edwardhartnett
Copy link
Collaborator

Is there a test that demonstrates this problem?

@dmwright526
Copy link
Collaborator Author

@edwardhartnett The unit test currently in the repository will produce the error (make test)

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Mar 3, 2023
and to run its unit test on Hera.

Part of ufs-community#596.
@GeorgeGayno-NOAA
Copy link
Collaborator

GeorgeGayno-NOAA commented Mar 3, 2023

To reproduce the problem on Hera, use my branch at 9ea1e2d. I get the following error:

Test project /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/build
    Start 1: fvcom_tools-ftst_readfvcomnetcdf
1/1 Test #1: fvcom_tools-ftst_readfvcomnetcdf ...***Exception: SegFault  0.06 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.07 sec

The following tests FAILED:
          1 - fvcom_tools-ftst_readfvcomnetcdf (SEGFAULT)

The segfault occurs at "type(fcst_nwp)" declaration line of the test:

 data lbcvice_expected /1, -99.99999/
 data t2_expected /2 / !expect second time index from fvcom file

 type(ncio) :: geo !grid data object
 type(fcst_nwp) :: fcst !object to read data into

 character(len=180) :: fv3file !fv3 file name

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Mar 3, 2023
@GeorgeGayno-NOAA
Copy link
Collaborator

Commenting out some of the procedure statements (0de2503) eliminates the 'seg fault'. The test fails later, as expected.

+ ctest
Test project /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/build
    Start 1: fvcom_tools-ftst_readfvcomnetcdf
1/1 Test #1: fvcom_tools-ftst_readfvcomnetcdf ...***Failed    0.09 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.10 sec

The following tests FAILED:
          1 - fvcom_tools-ftst_readfvcomnetcdf (Failed)

@GeorgeGayno-NOAA
Copy link
Collaborator

Adding a new - but simple - subroutine to the class (0de2503) results in a seg fault:

+ ctest
Test project /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/build
    Start 1: fvcom_tools-ftst_readfvcomnetcdf
1/1 Test #1: fvcom_tools-ftst_readfvcomnetcdf ...***Exception: SegFault  0.06 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.06 sec

The following tests FAILED:
          1 - fvcom_tools-ftst_readfvcomnetcdf (SEGFAULT)

So, there is something about those class procedures/subroutines that GNU does not like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants