-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1073, software bus locking #1092
Fix #1073, software bus locking #1092
Conversation
982db6e
to
c3ecf28
Compare
Note this also fixes #1078 -as it removes the problematic routine entirely. |
Fixes concurrency issues in SB. Will probably follow up with moving this to a module since there's multiple ways to implement the Resource ID functionality. Accessing tables using PipeID will no longer be available to external access. ResourceIDs are no longer recycled and removes risk of PipeID aliasing Update cFS-GroundSystem to reflect changes in structure sizes. @jphickey Create "retroactive" issue to document bugs with event and error messages inconsistencies due to lock handling. Remove lingering unnecessary debug artifacts Link this to the "remove pipename from pipe structure" #287 |
This is excellent work, @jphickey ! I agree with all of your changes. Given the scope of these changes, how should we test and roll this out? SBN touches a fair bit of SB, and I've been trying to bring SBN up to "caelum". I may try this branch against my code but do we plan to merge into main soon? (As an aside, I started at NASA as a contractor for a company called Caelum.) |
BTW- regarding discussion of old versions that are affected by this bug.... I wanted to add one important detail: Although some general race conditions certainly have existed across SB for quite some time, this particular issue (or at the least the specific race condition that was actually causing failures described in #1073) was introduced as part of the SB API refactor that was done a month or two ago - that is in the transition to Of course there were a ton of other race conditions as well, such as accessing the pipe/routing table - but these didn't seem to manifest themselves in a noticeable way. The other race conditions fixed by this PR address problem areas with Creating/Deleting SB pipes and route subscribe/unsubscribe ops. While Bootes probably has problems there if you push it hard enough, most FSW doesn't do repetitive subscribe/unsubscribe operations enough to expose these problems. So in other words - I wouldn't necessarily worry about Bootes for normal steady-state message passing. The specific issue that likely manifested itself as #1073 was introduced more recently. |
Yes SBN is an important use case that I was not really able to test as part of this - I don't really have any test setups that use SBN. I'd be hoping that you (or someone!) can pull this into an SBN setup. Note that aside from the scary-looking refactoring, it actually doesn't change that much in the way of actual logic. It gets rid of some redundant struct members and consolidates access to the structs into a locked section, but tries hard not to actually change the behavior of the way these things actually operated. |
Move certain definitions related to the CFE_ES_ResourceID_t type into the global CFE include files. This introduces two new headers: cfe_resourceid.h (public) cfe_resourceid_internal.h (private to CFE core apps) This allows other CFE core apps, such as SB, to use the CFE_ES_ResourceID_t using the same manipulators.
Significant refactor of many SB API calls to address inconsistencies with respect to locking and unlocking of global data structures. First this updates the definition of CFE_SB_PipeId_t to use the CFE_ES_ResourceID_t base type, and a new ID range. Notably this prevents direct access to the CFE_SB.PipeTbl global, forcing code to go through the proper lookup routine, which should only be done while locked. All API implementations follow the same general pattern: - Initial checks/queries while unlocked - Lock SB global - Lookups and/or modifications to the pipe table/routing info - Unlock SB global - Invoke other subsystems (e.g. OSAL) - Re-lock SB global (if needed) do final update, and unlock again - Send all events All error counters should be updated at the end, while still locked. All event processing is deferred to the end of each function, after all other processing is done.
c3ecf28
to
b17cd1e
Compare
Rebased to "main" branch (2021-01-12 baseline) and resolved conflicts |
CCB:2021-01-13 APPROVED |
Describe the contribution
The root cause of the mysterious message delivery issues was tracked down to inconsistent locking. When running on multi core systems a variety of buffer corruption issues were observed. This fix is really a rework of all CFE SB API implementations toward the goal of more consistent locking patterns to solve this issue.
Because SB APIs make heavy use of event IDs, this presents a challenge because events can only be sent while unlocked. In order to ensure that all events are generated and all counters are incremented consistently, the functions should run through to the end and not return early, and use a "PendingEventID" register to record what event ID should be sent per the current operation, rather than simply sending the event at the time the condition is identified.
The general pattern becomes:
This also employs the
CFE_ES_ResourceID_t
type and related patterns for managing the SB Pipe IDs. In particular this (intentionally) makes it not possible to use this directly an array index, and will break code which directly accessed these items without going through the lookup function.Fixes #985
Fixes #1073
Fixes #1096
Testing performed
Build and sanity check CFE
Run all unit tests
Also Let CFE run for approx 72 hours, and observed no more "invalid msgid" or memory pool errors.
Expected behavior changes
CFE_SB_PipeId_t
type is no longer usable as a direct array index. It is also increased in size from 8 bits to 32 bits. But as a result is now consistent with all other ID types in both behavior and size. As long as the typedef is used and PipeIds are used only as intended,, this should not be noticeable to apps.CFE_SB_PipeId_t
value, hence why it had to be updated because the type is now bigger. The spare bytes are also moved to the end of the struct.Neither of these are expected to be particularly problematic - just stuff to be aware of.
System(s) tested on
Ubuntu 20.04
Additional context
Not sure if the change to
CFE_SB_PipeId_t
size (and the TLM message that contains this type) might require an update to cFS-GroundSystem.Contributor Info - All information REQUIRED for consideration of pull request
Joseph Hickey, Vantage Systems, Inc.
EDIT - Fix #948 - eliminates identified double locks.