Add service "health" parameter to HK (and provide a suggested pattern for apps to follow) #1469

skliper · 2021-04-30T16:04:52Z

Is your feature request related to a problem? Please describe.
Historically syslog or events are used to report issues, and telemetry status reporting is likely scattered and/or inconsistent. Not easy to really be sure everything is "healthy" at a glance. Example issue is with system startup synchronization, there isn't an easy way to tell (especially if there's spotty com) that startup synchronization was successful. There's also other cases where operation continues "best effort" in failure conditions, since there isn't anything that can really be done from within the system.

Describe the solution you'd like
Add an app/service health summary parameter to HK, 0 is healthy and nonzero bits could indicate specific issues have been encountered. Latch on condition, but clear with the a reset command. Proper synchronization is an easy first condition to add, but scrub for others to include in the summary. With this addition, reduces the dependency on syslog/events for a monitoring system (like HS or an "external" monitor) or the ground to take appropriate action.

Additionally many of the CDS "errors" are simply written to the system log (or not) and initialization continues. When these things fail there is something wrong or something got corrupted, needs to be more obvious (examples):

cFE/modules/tbl/fsw/src/cfe_tbl_internal.c

Lines 155 to 163 in 84ba9a9

    
           if (Status != CFE_SUCCESS) 
        
           { 
        
               /* Note if we were unable to recover error free Critical Table Registry from the CDS */ 
        
               CFE_ES_WriteToSysLog("CFE_TBL:EarlyInit-Failed to recover Critical Table Registry (Err=0x%08X)\n", 
        
                                    (unsigned int)Status); 
        
           } 
        
           /* Whether we recovered the Critical Table Registry or not, we are successful with initialization */ 
        
           Status = CFE_SUCCESS;

cFE/modules/tbl/fsw/src/cfe_tbl_internal.c

Lines 167 to 173 in 84ba9a9

    
           /* Not being able to support Critical Tables is not the end of the world */ 
        
           /* Note the problem and move on */ 
        
           CFE_ES_WriteToSysLog("CFE_TBL:EarlyInit-Failed to create Critical Table Registry (Err=0x%08X)\n", 
        
                                (unsigned int)Status); 
        
           /* Failure to support critical tables is not a good enough reason to exit the cFE on start up */ 
        
           Status = CFE_SUCCESS;

cFE/modules/tbl/fsw/src/cfe_tbl_internal.c

Lines 177 to 188 in 84ba9a9

    
           /* Save the initial version of the Critical Table Registry in the CDS */ 
        
           Status = CFE_ES_CopyToCDS(CFE_TBL_Global.CritRegHandle, CFE_TBL_Global.CritReg); 
        
           if (Status != CFE_SUCCESS) 
        
           { 
        
               /* Not being able to support Critical Tables is not the end of the world */ 
        
               /* Note the problem and move on */ 
        
               CFE_ES_WriteToSysLog("CFE_TBL:EarlyInit-Failed to save Critical Table Registry (Err=0x%08X)\n", 
        
                                    (unsigned int)Status); 
        
               /* Failure to support critical tables is not a good enough reason to exit the cFE on start up */ 
        
               Status = CFE_SUCCESS;

Describe alternatives you've considered
None

Additional context
#1466 would allow apps to add the sync status, note also #1467 would provide the syslog. Spawned from issues discussed at code review.

Requester Info
Jacob Hageman - NASA/GSFC

skliper · 2021-04-30T16:08:07Z

Concept here isn't to duplicate what's provided in events/syslog, but at a higher level report "real" health issues. Basically things that mean the system really isn't behaving or configured correctly. Allows for continued ops (if monitor doesn't trigger reset) if required to perform recovery options, while also providing situational awareness.

skliper · 2021-08-26T21:51:54Z

Another syslog only reporting of an unhealthy system:

cFE/modules/tbl/fsw/src/cfe_tbl_internal.c

Lines 750 to 759 in 5e41330

    
           /* Take Mutex to make sure we are not trying to grab a working buffer that some */ 
        
           /* other application is also trying to grab. */ 
        
           OsStatus = OS_MutSemTake(CFE_TBL_Global.WorkBufMutex); 
        
           /* Make note of any errors but continue and hope for the best */ 
        
           if (OsStatus != OS_SUCCESS) 
        
           { 
        
               CFE_ES_WriteToSysLog("%s: Internal error taking WorkBuf Mutex (Status=%ld)\n", __func__, 
        
                                    (long)OsStatus); 
        
           }

skliper · 2021-09-01T19:41:14Z

There's also numerous cases of conditions to test for things that should never happen, with inconsistent responses. One idea from recent discussions is to add an API with a "soft exception" sort of concept where we capture context and have a configurable response (and persistent reporting mechanism if applicable). Something like "system tainted" or similar. Could reduce event clutter and provide for more consistent reporting.

One example is the ID match failure case here where a cleanup action will happen and message will get reported, but it's not really obvious that something might be seriously broken (similar cases where mutex actions fail):

cFE/modules/es/fsw/src/cfe_es_apps.c

Line 1150 in e5d4ed9

if (CFE_ES_AppRecordIsMatch(AppRecPtr, AppId))

skliper · 2024-07-22T17:29:53Z

I've got a health reporting library prototyped that manages status bits and a counter that works nicely for HK reporting. If anyone ever wants to help advance this or wants to collaborate or just status let me know. Still maturing API's, but using it in a real world use case and the concept seems helpful/useful especially for automation/autonomy.

skliper added the enhancement label Apr 30, 2021

skliper mentioned this issue Aug 26, 2021

TBL uncovered lines in CFE_TBL_LoadCmd, no alternative error codes from CFE_TBL_GetWorkingBuffer #1901

Closed

skliper mentioned this issue Jan 3, 2023

Fix #1985, Check return value of CFE_ES_PutPoolBuf #2235

Open

2 tasks

skliper mentioned this issue Mar 3, 2023

CFE_SB_GetBufferFromPool discarding CFE_ES_GetPoolBuf error status #2251

Open

skliper mentioned this issue May 24, 2023

fix nasa#2316 - adding CFE_TIME_StringFmt() and CFE_TIME_StringFmtLen… #2345

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add service "health" parameter to HK (and provide a suggested pattern for apps to follow) #1469

Add service "health" parameter to HK (and provide a suggested pattern for apps to follow) #1469

skliper commented Apr 30, 2021 •

edited

Loading

skliper commented Apr 30, 2021

skliper commented Aug 26, 2021

skliper commented Sep 1, 2021

skliper commented Jul 22, 2024

Add service "health" parameter to HK (and provide a suggested pattern for apps to follow) #1469

Add service "health" parameter to HK (and provide a suggested pattern for apps to follow) #1469

Comments

skliper commented Apr 30, 2021 • edited Loading

skliper commented Apr 30, 2021

skliper commented Aug 26, 2021

skliper commented Sep 1, 2021

skliper commented Jul 22, 2024

skliper commented Apr 30, 2021 •

edited

Loading