Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service "health" parameter to HK (and provide a suggested pattern for apps to follow) #1469

Open
skliper opened this issue Apr 30, 2021 · 4 comments

Comments

@skliper
Copy link
Contributor

skliper commented Apr 30, 2021

Is your feature request related to a problem? Please describe.
Historically syslog or events are used to report issues, and telemetry status reporting is likely scattered and/or inconsistent. Not easy to really be sure everything is "healthy" at a glance. Example issue is with system startup synchronization, there isn't an easy way to tell (especially if there's spotty com) that startup synchronization was successful. There's also other cases where operation continues "best effort" in failure conditions, since there isn't anything that can really be done from within the system.

Describe the solution you'd like
Add an app/service health summary parameter to HK, 0 is healthy and nonzero bits could indicate specific issues have been encountered. Latch on condition, but clear with the a reset command. Proper synchronization is an easy first condition to add, but scrub for others to include in the summary. With this addition, reduces the dependency on syslog/events for a monitoring system (like HS or an "external" monitor) or the ground to take appropriate action.

Additionally many of the CDS "errors" are simply written to the system log (or not) and initialization continues. When these things fail there is something wrong or something got corrupted, needs to be more obvious (examples):

if (Status != CFE_SUCCESS)
{
/* Note if we were unable to recover error free Critical Table Registry from the CDS */
CFE_ES_WriteToSysLog("CFE_TBL:EarlyInit-Failed to recover Critical Table Registry (Err=0x%08X)\n",
(unsigned int)Status);
}
/* Whether we recovered the Critical Table Registry or not, we are successful with initialization */
Status = CFE_SUCCESS;

/* Not being able to support Critical Tables is not the end of the world */
/* Note the problem and move on */
CFE_ES_WriteToSysLog("CFE_TBL:EarlyInit-Failed to create Critical Table Registry (Err=0x%08X)\n",
(unsigned int)Status);
/* Failure to support critical tables is not a good enough reason to exit the cFE on start up */
Status = CFE_SUCCESS;

/* Save the initial version of the Critical Table Registry in the CDS */
Status = CFE_ES_CopyToCDS(CFE_TBL_Global.CritRegHandle, CFE_TBL_Global.CritReg);
if (Status != CFE_SUCCESS)
{
/* Not being able to support Critical Tables is not the end of the world */
/* Note the problem and move on */
CFE_ES_WriteToSysLog("CFE_TBL:EarlyInit-Failed to save Critical Table Registry (Err=0x%08X)\n",
(unsigned int)Status);
/* Failure to support critical tables is not a good enough reason to exit the cFE on start up */
Status = CFE_SUCCESS;

Describe alternatives you've considered
None

Additional context
#1466 would allow apps to add the sync status, note also #1467 would provide the syslog. Spawned from issues discussed at code review.

Requester Info
Jacob Hageman - NASA/GSFC

@skliper
Copy link
Contributor Author

skliper commented Apr 30, 2021

Concept here isn't to duplicate what's provided in events/syslog, but at a higher level report "real" health issues. Basically things that mean the system really isn't behaving or configured correctly. Allows for continued ops (if monitor doesn't trigger reset) if required to perform recovery options, while also providing situational awareness.

@skliper
Copy link
Contributor Author

skliper commented Aug 26, 2021

Another syslog only reporting of an unhealthy system:

/* Take Mutex to make sure we are not trying to grab a working buffer that some */
/* other application is also trying to grab. */
OsStatus = OS_MutSemTake(CFE_TBL_Global.WorkBufMutex);
/* Make note of any errors but continue and hope for the best */
if (OsStatus != OS_SUCCESS)
{
CFE_ES_WriteToSysLog("%s: Internal error taking WorkBuf Mutex (Status=%ld)\n", __func__,
(long)OsStatus);
}

@skliper
Copy link
Contributor Author

skliper commented Sep 1, 2021

There's also numerous cases of conditions to test for things that should never happen, with inconsistent responses. One idea from recent discussions is to add an API with a "soft exception" sort of concept where we capture context and have a configurable response (and persistent reporting mechanism if applicable). Something like "system tainted" or similar. Could reduce event clutter and provide for more consistent reporting.

One example is the ID match failure case here where a cleanup action will happen and message will get reported, but it's not really obvious that something might be seriously broken (similar cases where mutex actions fail):

if (CFE_ES_AppRecordIsMatch(AppRecPtr, AppId))

@skliper
Copy link
Contributor Author

skliper commented Jul 22, 2024

I've got a health reporting library prototyped that manages status bits and a counter that works nicely for HK reporting. If anyone ever wants to help advance this or wants to collaborate or just status let me know. Still maturing API's, but using it in a real world use case and the concept seems helpful/useful especially for automation/autonomy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant