Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver dies with a seg-fault rather than a graceful abort if DRV_RESTART_POINTER file does not exist #524

Open
ekluzek opened this issue Dec 19, 2024 · 1 comment
Labels
bug Something isn't working enhancement New feature or request

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Dec 19, 2024

If the file pointed to by DRV_RESTART_POINTER does not exist, the driver fails with a seg-fault rather than writing a graceful exit about the file not existing.

This is in what will be ctsm5.3.016 with cime6.1.49 and cmeps1.0.32

The full description is here:

ESCOMP/CTSM#2914

The tests that fail are:

ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly
ERS_P128x1_Ld765.f10_f10_mg37.I2000Clm60Fates.derecho_intel.clm-FatesColdNoComp

In the cesm.log file for the first, only the cesm.log file is generated

cesm.log

cat /glade/derecho/scratch/erik/tests_ctsm5316acl/ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly.GC.ctsm5316acl_int/run/case2run/cesm.log.7269007.desched1.241218-154730
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf) Read in prof_inparm namelist from: drv_in
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf) Using profile_disable=          F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_timer=                      4
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_depth_limit=               12
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_detail_limit=               2
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_barrier=          F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_outpe_num=                  1
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_outpe_stride=               0
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_single_file=      F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_global_stats=     T
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_ovhd_measurement= F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_add_detail=       F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_papi_enable=      F
dec2343.hsn.de.hpc.ucar.edu 0:  ESMF_Finalize: Error closing trace stream
dec2343.hsn.de.hpc.ucar.edu 0: MPICH ERROR [Rank 0] [job id 2dd16cc6-e949-427e-bb59-48726c16f9fa] [Wed Dec 18 15:47:41 2024] [dec2343] - Abort(1) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 0
dec2343.hsn.de.hpc.ucar.edu 0: 
dec2343.hsn.de.hpc.ucar.edu 0: forrtl: severe (174): SIGSEGV, segmentation fault occurred
dec2343.hsn.de.hpc.ucar.edu 0: Image              PC                Routine            Line        Source             
dec2343.hsn.de.hpc.ucar.edu 0: libpthread-2.31.s  000015004133C8C0  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003F2FBE7E  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003F10A22F  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003D7376A8  MPI_Abort             Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049332277  _ZN5ESMCI3VMK5abo     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049330814  _ZN5ESMCI2VM5abor     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         00001500493476E5  c_esmc_vmabort_       Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049B5C7A8  esmf_vmmod_mp_esm     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         00001500499CC1EE  esmf_initmod_mp_e     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: cesm.exe           0000000000433ADA  MAIN__                    132  esmApp.F90
dec2343.hsn.de.hpc.ucar.edu 0: cesm.exe           00000000004230FD  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libc-2.31.so       000015003C7E129D  __libc_start_main     Unknown  Unknown

drv.log:

cat /glade/derecho/scratch/erik/tests_ctsm5316acl/ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly.GC.ctsm5316acl_int/run/case2run/drv.log.7269007.desched1.241218-154730
  read rpointer file = rpointer.cpl.2001-01-18-00000

@ekluzek ekluzek added bug Something isn't working enhancement New feature or request labels Dec 19, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 19, 2024

Looking at the code, there is error handling for this as follows:

cesm/driver/esm_time_mod.F90:

          call NUOPC_CompAttributeGet(instance_driver, name='drv_restart_pointer', value=restart_pfile, rc=rc)
          if (ChkErr(rc,__LINE__,u_FILE_u)) return

          if (trim(restart_pfile) /= 'none') then

             if (maintask) then
                write(logunit,*) " read rpointer file = "//trim(restart_pfile)
                inquire( file=trim(restart_pfile), exist=exists)
                if (.not. exists) then
                   rc = ESMF_FAILURE
                   call ESMF_LogWrite(trim(subname)//' ERROR rpointer file '//trim(restart_pfile)//' not found', &
                        ESMF_LOGMSG_ERROR, line=__LINE__, file=__FILE__)
                   return
                endif

So it outputs to the ESMF PET files, but no PET files were created with the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant