Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFE ES "StartupSyncSemaphore" subject to multiple race conditions #71

Closed
skliper opened this issue Sep 30, 2019 · 25 comments
Closed

CFE ES "StartupSyncSemaphore" subject to multiple race conditions #71

skliper opened this issue Sep 30, 2019 · 25 comments
Labels
Milestone

Comments

@skliper
Copy link
Contributor

skliper commented Sep 30, 2019

The "Startup Sync" mechanisms are based on a binary semaphore, a boolean flag, and a counter.

The handling of these various separate entities leaves several opportunities for race conditions to occur. At a minimum, this could cause "WaitForStarupSync" to pend incorrectly, but could have other more serious side effects (unknown) depending on how the apps are using this.

This is one problem that the EVA team at GRC are experiencing while deploying CFS on the Xilinx Microblaze platform.

@skliper skliper added this to the 6.4.2 milestone Sep 30, 2019
@skliper skliper self-assigned this Sep 30, 2019
@skliper skliper added the bug label Sep 30, 2019
@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Imported from trac issue 40. Created by jphickey on 2015-05-06T15:43:00, last modified: 2016-08-11T15:53:15

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-07 12:11:04:

The fundamental issue here is that the conditions are checked under the CFE_ES global data lock, but when the decision is made to pend on the semaphore, there is no way to atomically release the lock AND pend on the semaphore.

There is an opportunity after releasing the lock but before actually pending on the semaphore for the conditions to change.

Simply put, an OSAL binary semaphore is not going to reliably work here.

Although it is not as desirable, the best way to do this may be to simply poll for it, calling OS_TaskDelay so that the other threads can execute while waiting. It is not ideal, but it will work and will not be subject to the issues of trying to have many tasks pending on the same event.

A correct pend-and-release operation that the startup sync is //trying// to do would require the equivalent of a POSIX condition variable, which OSAL does not expose or have an API for.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-08 13:33:11:

Pushed commit [changeset:2633a3d] for review

This modifies the CFE_ES_WaitForStartupSync() to use a simpler polling loop (with delay) rather than a semaphore. This may cause some extra scheduling delays as the waiting tasks will be performing a delay rather than pending on a semaphore, but it will be reliable. Any such delays will be confined to start up, and typically less than 50ms (the delay time), so it should not be a major problem.

This also adds calls into WaitForStartupSync from the core applications which was not done before. This forces each core app to reach its respective main loop before the next core app will be started, regardless of what the platform has the priority set to. This is really part of the fixes for trac #73, but it did not make sense to split it from this one as they depend on each other.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2015-05-15 15:15:36:

Recommend making the CFE_ES_STARTUP_SYNC_POLL_MSEC macro a configuration parameter. This will provide some flexibility as processors are getting faster.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2015-05-15 15:37:48:

It appears the CFE_ES_ApplicationSyncDelay function will force the applications to delay for CFE_ES_STARTUP_SYNC_POLL_MSEC vs. the time the application specifies when making the call to CFE_ES_WaitForStartupSync. This will make the application XX_STARTUP_SYNC_TIMEOUT configuration parameters obsolete.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-15 15:50:39:

The total delay done by CFE_ES_ApplicationSyncDelay() still depends on the "TimeOutMilliseconds" parameter.

CFE_ES_ApplicationSyncDelay() calls OS_TaskDelay() in a loop, it just uses a shorter interval for polling - this is the CFE_ES_STARTUP_SYNC_POLL_MSEC value. It will poll/delay repeatedly until the XX_STARTUP_SYNC_TIMEOUT is reached.

Note that this does suffer from the delay-extension problem, where scheduling delays are not factored into the delay calculation thereby extending it by some unknown quantity. As this is confined to start-up, I don't expect that will matter in this case.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2015-05-15 15:59:14:

Recommend changing the name of the CFE_ES_SYSTEM_STATE_APP_STARTUP macro to CFE_ES_SYSTEM_STATE_COREAPP_STARTUP or CFE_ES_SYSTEM_STATE_CORE_STARTUP.

This will make the source code more readable without having to reference the comments in es_apps.h. On a fist look at the code the name of this macro was a little confusing as to whether the core applications were being started in this system state, the external applications, or both.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2015-05-15 16:10:24:

Also recommend making the CFE_ES_STARTUP_SCRIPT_TIMEOUT_MSEC a configuration parameter.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2015-05-15 16:22:07:

Replying to [comment:6 jphickey]:

The total delay done by CFE_ES_ApplicationSyncDelay() still depends on the "TimeOutMilliseconds" parameter.

CFE_ES_ApplicationSyncDelay() calls OS_TaskDelay() in a loop, it just uses a shorter interval for polling - this is the CFE_ES_STARTUP_SYNC_POLL_MSEC value. It will poll/delay repeatedly until the XX_STARTUP_SYNC_TIMEOUT is reached.

Note that this does suffer from the delay-extension problem, where scheduling delays are not factored into the delay calculation thereby extending it by some unknown quantity. As this is confined to start-up, I don't expect that will matter in this case.

OK, yes, I see that now. If all are OK with the "crude" timing, it would be worthwhile putting in a ticket to address the TBD in the future.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2015-05-15 16:23:58:

What is the rational for placing CFE_CORE_MAX_STARTUP_TIME in the private header file vs. the platform configuration header file?

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-15 16:26:42:

Regarding the CFE_ES_SYSTEM_STATE_APP_STARTUP -> CFE_ES_SYSTEM_STATE_CORE_STARTUP name change - I agree with this, will do that.

For the configuration parameters CFE_ES_STARTUP_SYNC_POLL_MSEC, CFE_CORE_MAX_STARTUP_TIME, and CFE_ES_STARTUP_SCRIPT_TIMEOUT_MSEC, we can make these platform config parameters, but this means that anyone who pulls in this changeset will need to also update their cfe_platform_cfg.h file accordingly or their build will break.

By putting these where they are now, it does not require any updates to cfe_platform_cfg.h

That is the only trade-off. If this is acceptable then I can take these defines out of the current spot and move them all into the platform config.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by dheater on 2015-05-15 17:03:43:

In cfe_es_start.c; It looks to me like CFE_ES_ApplicationSyncDelay will '''always''' return when there is a timeout. Even if you have not achieved the minimum system state. It is called from line 293 and the subsequent lines seem to assume that the core startup has completed although that condition is not confirmed.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-15 17:13:25:

Replying to [comment:12 dheater]:

In cfe_es_start.c; It looks to me like CFE_ES_ApplicationSyncDelay will '''always''' return when there is a timeout. Even if you have not achieved the minimum system state. It is called from line 293 and the subsequent lines seem to assume that the core startup has completed although that condition is not confirmed.

CFE_ES_ApplicationSyncDelay() will return CFE_ES_OPERATION_TIMED_OUT if too much time elapses and the condition is still not met.

In the "new" spots where this is called (cfe_es_start.c lines 293 and 951) it is verified that the function returned CFE_SUCCESS before moving on.

However the existing "CFE_ES_WaitForStartupSync()" API already returned "void" and this API was not modified as part of this change. Therefore this function does not confirm whether the conditions were actually met, it has no way of propagating this information, and existing apps were not handling it.

A logical/possible improvement here would be to make CFE_ES_WaitForStartupSync() return an int32 so it can propagate a timeout error up the call stack. Existing apps that don't check the return value are not any worse-off than they already were.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by dheater on 2015-05-15 17:18:24:

Nevermind. On line 278, you've already set SystemState to CORE_READY. Still maybe not the behavior one would expect from CFE_ES_ApplicationSyncDelay, but shouldn't cause an error in this case.

Replying to [comment:12 dheater]:

In cfe_es_start.c; It looks to me like CFE_ES_ApplicationSyncDelay will '''always''' return when there is a timeout. Even if you have not achieved the minimum system state. It is called from line 293 and the subsequent lines seem to assume that the core startup has completed although that condition is not confirmed.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-15 17:46:58:

Replying to [comment:14 dheater]:

Nevermind. On line 278, you've already set SystemState to CORE_READY. Still maybe not the behavior one would expect from CFE_ES_ApplicationSyncDelay, but shouldn't cause an error in this case.

After re-reviewing the code with all these comments I concur that the "system state" parameter is a bit confusing. The problem/objective here is that we need to wait for slightly different things in an external app vs. a core app for startup sync. The main difference being that core apps can/should enter their own main loop and actually do not need to wait for all //subsequent// core apps to be running. External apps, on the other hand, do need to wait for subsequent external apps e.g. just in case the first app listed in "cfe_es_startup.scr" depends on the last app in the file. That is really the only difference, and it is what the "system state" check is supposed to accomplish.

That being said, I have an idea of how I can possibly simplify this further and make it easier to follow. I will try it and report back.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-15 18:24:33:

Replying to [comment:15 jphickey]:

Replying to [comment:14 dheater]:

Nevermind. On line 278, you've already set SystemState to CORE_READY. Still maybe not the behavior one would expect from CFE_ES_ApplicationSyncDelay, but shouldn't cause an error in this case.
That being said, I have an idea of how I can possibly simplify this further and make it easier to follow. I will try it and report back.

The simplification I had been thinking of would leave some holes so I think the existing version is still better.

The problem would be an app started from a CFE ES {{{CFE_ES_START_APP_CC}}} command.

Theoretically, such a command could be received the instant that CFE_ES starts running its main loop, which would be during the core app init phase, potentially even before e.g. CFE_TIME or CFE_TBL is up and running. This trac doesn't prevent that, but if that app //does// happen to call CFE_ES_WaitForStartupSync() then it will at least cause it to properly wait until all the apps have started, including those in the startup script, before returning.

In practice, such a command would probably have to come through an app like CI, which itself wouldn't be running until all core apps are running.

Possibly we might want to consider "holding off" all CORE app main loops until all CORE apps are up, this would implicitly fill the void I described, it would also simplify things by making the external app sync even more similar to the core app sync.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by acudmore on 2015-05-18 13:15:37:

Approve

I think we should put the defines in the system config file. Would it be worth keeping the defines in the source, but using #ifndef statements, or is that just asking for trouble?

I also agree that while the delay loop is not the best solution, it has minimal impact to a running system, and should be reliable. Being able to tune the delay in the system configuration will help. Some CPUs may support a very small delay.

Does this change require apps to use the WaitForStartupSync call? Do we want to require all apps to use this?

Finally, is there anything here that would break on an SMP system?

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-18 14:47:02:

Replying to [comment:17 acudmore]:

I think we should put the defines in the system config file. Would it be worth keeping the defines in the source, but using #ifndef statements, or is that just asking for trouble?

In past experience, having defines in two places is risky. It is a bigger risk if used for the size of something (like an array) because it can mean a different mix or just order of {{{#include}}} statements can //silently// produce incompatible results with no errors or warnings.

As this is only a timeout, that's probably not much of a risk in this particular case, but I still try to avoid this as a general rule.

Does this change require apps to use the WaitForStartupSync call? Do we want to require all apps to use this?

The usage of WaitForStartupSync for external apps is unchanged, and this change is totally transparent to apps. If apps do not call WaitForStartupSync then it works off the call to RunLoop, which is the same as it worked previously.

That being said, the call to RunLoop could be slightly more efficient if we separated off the "change state to running" part from the rest of it.... but this is very minor, and probably not worth changing every app out there to do this.

Finally, is there anything here that would break on an SMP system?

This is all SMP-friendly. (Actually, SMP compatibility is one of the main reasons I am proposing this counter+delay loop solution).

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jwilmot on 2015-05-19 09:30:28:

Approve. Suggest having a matching CFE_ES_WriteToSysLog call for each state transition. This will also log time. Event message text should also be consistent with state names.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-19 17:15:58:

For those interested, I have pushed commit [changeset:eaa7c63] for further review.

NOTE: This commit is based off the current "master" branch, not the "development" branch

The commit has been rebased/backported to apply to the original cFE 6.4.1 release, and includes the following changes from the CCB discussion:

  • Rename the "APP_STARTUP" state to "CORE_STARTUP"
  • Include a CFE_ES_WriteToSyslog() with each state transition. 2 were added, the other 2 are just modifications to existing syslog calls to include the name of the state.
  • Move the timeout macro definitions to the "cfe_platform_cfg.h" files. These have been added to the sample config files distributed under the "fsw/platform_inc" subdirectory.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2015-05-27 17:42:25:

Approve

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-05-29 13:09:21:

Testing note - additional update in commit [changeset:ac557bef]

I have been testing with this trac in place and there is one discovery I made that I felt was significant enough to warrant an additional update. In the particular test case, an application (intentionally) misbehaved and called neither {{{CFE_ES_RunLoop}}} nor {{{CFE_ES_WaitForStartupSync}}} as part of it's main function.

This (correctly) caused the startup sync in the main thread to wait for the timeout to occur before transitioning to the OPERATIONAL state.

However, the issue was that all of the CORE applications were also held off, in addition to the other external applications that were waiting for startup sync. Everything got released and worked normally after the timeout expired and the system entered the OPERATIONAL state.

The problem is: during the time that the applications were held in a sync delay, messages from child threads (or other apps that did not need a sync delay) built up in the queues to the CORE applications, these were supposed to be running but were not. Message limit/queue overflow errors eventually occurred, followed by a flood of activity once the tasks were released.

The referenced commit fixes this problem.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sduran on 2015-06-01 11:09:20:

Approve.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by wmoleski on 2015-06-16 13:36:38:

cFE 6.4.2.0 was tested using an MCP750 running vxworks 6.4. All build verification tests normally executed on a cFE release were successfully executed. An additional test was performed in an attempt to verify that the race conditions identified did not appear. The UART (minicom) output for the test is attached to trac #81.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sstrege on 2016-08-11 15:53:04:

Merged to development on Wed May 20 09:39:24 2015 -0400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant