Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global_enkf ctest seg faults in debug mode #776

Open
RussTreadon-NOAA opened this issue Jul 28, 2024 · 9 comments
Open

global_enkf ctest seg faults in debug mode #776

RussTreadon-NOAA opened this issue Jul 28, 2024 · 9 comments

Comments

@RussTreadon-NOAA
Copy link
Contributor

enkf.x built in debug mode from develop at 3e27bb8 abort with the following error on WCOSS2 (Cactus)

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
enkf.x             0000000004F1793B  Unknown               Unknown  Unknown
libpthread-2.31.s  0000153993EA08C0  Unknown               Unknown  Unknown
enkf.x             000000000053EA05  letkf_mp_letkf_up         279  letkf.f90
libiomp5.so        00001539952E93F3  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        000015399526D937  __kmp_fork_call       Unknown  Unknown
libiomp5.so        0000153995231533  __kmpc_fork_call      Unknown  Unknown
enkf.x             0000000000538471  letkf_mp_letkf_up         279  letkf.f90

Line 279 of letkf.f90 is the start of a large omp section

! Loop for each horizontal grid points on this task.                                                                                 
!$omp parallel do schedule(dynamic) default(none) private(npt,nob,nobsl, &                                                         
!$omp                  nobsl2,ngrd1,corrlength,ens_tmp,coslat, &                                                                   
!$omp                  nf,vdist,obens,indxassim,indxob,maxdfs, &                                                                   
!$omp                  nn,hxens,wts_ensmean,dfs,rdiag,dep,rloc,i, &                                                                
!$omp                  oindex,deglat,dist,corrsq,nb,nlev,nanal,sresults, &                    

It's not clear exactly what the error is.

NCO builds and runs codes in debug mode as part of their pre-implementation testing. As such the debug mode enkf.x failure must be examined, understood, and resolved. This issue is opened to document this error and work toward resolution of the problem.

Attention @CatherineThomas-NOAA

@RussTreadon-NOAA
Copy link
Contributor Author

Sensitivity test
Comment out the omp directives on lines 279 to 296 and omp line 527. Recompile in debug mode. Run global_enkf ctest. Debug enkf.x ran to completion.

We should carefully examine the omp directives for grdloop to ensure all directives are correct. We should also check all the variables in grdloop to ensure all variables that need to be declared omp private are declared as such.

Another interesting data point would be to activate the commented out omp directives, build enkf.x in debug mode using the WCOSS2 intel-classic compiler, and run the global_enkf ctest to and see if this enkf.x aborts or runs to completion.

@RussTreadon-NOAA RussTreadon-NOAA mentioned this issue Jul 30, 2024
7 tasks
@RussTreadon-NOAA
Copy link
Contributor Author

@CatherineThomas-NOAA , when is the GFS v17 code handoff to NCO? We need to ensure enkf.x does not seg fault when run in debug mode. This is a NCO IT test.

@CatherineThomas-NOAA
Copy link
Collaborator

@RussTreadon-NOAA: Code handoff is tentatively in late summer/fall 2025. We still have time but this will definitely need to be addressed before then.

@RussTreadon-NOAA
Copy link
Contributor Author

@CatherineThomas-NOAA , wow, that's 9 to 12 months out. I didn't realize the GFS v17 schedule had been adjusted that much. It's good to have more time. One concern is that things get forgotten and slip through the cracks. When is the EMC GFS v17 science freeze?

@CatherineThomas-NOAA
Copy link
Collaborator

@RussTreadon-NOAA: The science freeze is tentatively planned for this winter. Of course these are all subject to change, particularly as we learn more about the moratorium schedule/impacts.

@RussTreadon-NOAA
Copy link
Contributor Author

Sensitivity test - compiler

Comment above suggested building debug enkf.x using Intel 2021.6.0.20220226 instead of Intel 19.1.3.20200925. This test has been completed. enkf.x still aborts in the same manner

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
enkf.x             0000000004EB215B  Unknown               Unknown  Unknown
libpthread-2.31.s  000014AA6F0028C0  Unknown               Unknown  Unknown
enkf.x             000000000053D635  letkf_mp_letkf_up         279  letkf.f90
libiomp5.so        000014AA7047A893  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        000014AA703EE429  __kmp_fork_call       Unknown  Unknown
libiomp5.so        000014AA703AA425  __kmpc_fork_call      Unknown  Unknown
enkf.x             0000000000536E3A  letkf_mp_letkf_up         279  letkf.f90
enkf.x             0000000000414908  MAIN__                    210  enkf_main.f90
enkf.x             0000000000413292  Unknown               Unknown  Unknown
libc-2.31.so       000014AA6ECDC24D  __libc_start_main     Unknown  Unknown
enkf.x             00000000004131AA  Unknown               Unknown  Unknown

@RussTreadon-NOAA
Copy link
Contributor Author

Sensitivity test - number of threads

The failures reported above ran the debug enkf.x with one thread. Increasing the thread count to 2 does not alter behavior. The debug enkf.x still aborts in the same manner.

@jswhit2
Copy link
Contributor

jswhit2 commented Sep 20, 2024

I was able to run on hercules in debug mode with intel 2021.9.0 (not the ctest, but just replaced the executable with a debug version in a running experiment)

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @jswhit2 for running this test and letting us know. Entirely possible that we could be dealing with something specific to WCOSS2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants