-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue where its possible for a component to receive a unit without a config #2138
Fix issue where its possible for a component to receive a unit without a config #2138
Conversation
🌐 Coverage report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first pass this looks good. Will continue tomorrow, I haven't run the tests or installed the agent yet to double check this works.
require.NoError(t, err) | ||
} | ||
|
||
func TestManager_FakeInput_KeepsRestarting(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this or the other test ever reproduce the problem? What about if you reverted my ClearPendingCheckin
change that didn't completely fix the problem? Does it catch it then or not at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the tests catch the issue with no fixes applied at all? (is what I was trying to ask above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried it with your change removed and sadly these did not show the symptoms either. I wanted to keep the tests in the change because overall I thought they still exercised the code path well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed I like them too, maybe if we ran them for long enough they'd catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran the tests and confirmed the agent can install and enroll with Fleet.
It is difficult to tell if the issue is completely gone (I thought it was the first time) but these changes look like they should prevent it.
Using the latest changes in this branch I added the following change and ran the
So my change still does not 100% fix the issue, because it is possible that the goroutine reads from the channel and what is on the channel is invalid for a restart component. I am working more on this PR to have a full proof solution. I want to ensure there is no window of possibility that this happens. I am also looking at adding some defensive code on the elastic-agent-client side to ensure that even if it does happen it handles the case correctly. |
I completely removed all of the fixes including the first attempt at a fix that this PR replaces and ran the two unit tests here continuously for ~1 hour without reproducing a failure. This appears particularly challenging to reproduce in an isolated way, I'm glad we managed to recreate it artificially at least. |
if initObserved != nil { | ||
// the next call to `CheckinExpected` must be from the initial `CheckinObserved` message | ||
if observed != initObserved { | ||
// not the initial observed message; we don't send it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any scenario where the initial observed message could be lost or altered before we get here? Unlikely but if it this were possible we'd be stuck unable to configure anything.
Is there a client side timeout on the Checkin request from the component? If there is that would guard against this by giving it a way to make a second request and reset the initCheckinObserved message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the lack of an expected message being sent to the client would cause it to miss subsequent checkins that would also get us out of this theoretical state via the eventual restart of the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will result in an eventual restart of the process by the command runtime. So this situation is self healing, but extremely unlikely.
The latest round of changes looks good to me, although I would feel better with some targeted tests to ensure this definitely fixes the bug, or that we can't accidentally deadlock the checkin process waiting for an initial message that doesn't match what we expect. |
Installed locally and changed the output configuration a few times without issue. |
…t a config (#2138) * Fix issue where checkinExpected channel might have out dated information. * Run mage fmt. * Add changelog entry. * Increase rate lime for failure in test for slow CI runners. * Cleanups from code review. * Refactor the design of ensuring an initial expected comes from the observed message. (cherry picked from commit fefe64f)
…t a config (#2138) (#2151) * Fix issue where checkinExpected channel might have out dated information. * Run mage fmt. * Add changelog entry. * Increase rate lime for failure in test for slow CI runners. * Cleanups from code review. * Refactor the design of ensuring an initial expected comes from the observed message. (cherry picked from commit fefe64f) Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
What does this PR do?
Fixes an issue where it is possible for a component to receive a unit without a configuration. This was occurring because the usage of the checkinExpected channel was incorrect.
This fixes the issue by ensures that channel is empty on a new Checkin RPC call and prevent the sending goroutine from reading messages from the channel until the initial checkin has been processed.
This also adds a very defensive path where it only takes the latest
CheckinExpected
message and sends it, but only after parsing all previous messages to ensure that the latest message does not have a missing configuration that was sent in a previousCheckinExpected
.This also adds more testing.
Why is it important?
It was possible for a component to restart with the a missing configuration for a unit, which prevents the component from working correctly and it reports itself as failed.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog toolRelated issues