Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where its possible for a component to receive a unit without a config #2138

Merged
merged 6 commits into from
Jan 23, 2023

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Jan 18, 2023

What does this PR do?

Fixes an issue where it is possible for a component to receive a unit without a configuration. This was occurring because the usage of the checkinExpected channel was incorrect.

This fixes the issue by ensures that channel is empty on a new Checkin RPC call and prevent the sending goroutine from reading messages from the channel until the initial checkin has been processed.

This also adds a very defensive path where it only takes the latest CheckinExpected message and sends it, but only after parsing all previous messages to ensure that the latest message does not have a missing configuration that was sent in a previous CheckinExpected.

This also adds more testing.

Why is it important?

It was possible for a component to restart with the a missing configuration for a unit, which prevents the component from working correctly and it reports itself as failed.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@blakerouse blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team backport-v8.6.0 Automated backport with mergify labels Jan 18, 2023
@blakerouse blakerouse self-assigned this Jan 18, 2023
@blakerouse blakerouse requested a review from a team as a code owner January 18, 2023 19:48
@blakerouse blakerouse requested review from michalpristas and michel-laterman and removed request for a team January 18, 2023 19:48
@elasticmachine
Copy link
Contributor

elasticmachine commented Jan 18, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-01-19T20:50:06.029+0000

  • Duration: 17 min 2 sec

Test stats 🧪

Test Results
Failed 0
Passed 4833
Skipped 13
Total 4846

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@cmacknz cmacknz requested a review from AndersonQ January 18, 2023 20:33
@elasticmachine
Copy link
Contributor

elasticmachine commented Jan 18, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.305% (58/59) 👍
Files 69.268% (142/205) 👍
Classes 69.231% (270/390) 👍
Methods 54.047% (828/1532) 👍
Lines 39.288% (9026/22974) 👍 0.024
Conditionals 100.0% (0/0) 💚

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first pass this looks good. Will continue tomorrow, I haven't run the tests or installed the agent yet to double check this works.

pkg/component/runtime/runtime_comm.go Outdated Show resolved Hide resolved
pkg/component/fake/shipper/listener.go Show resolved Hide resolved
pkg/component/runtime/manager_test.go Outdated Show resolved Hide resolved
pkg/component/runtime/manager_test.go Outdated Show resolved Hide resolved
pkg/component/runtime/manager_test.go Outdated Show resolved Hide resolved
require.NoError(t, err)
}

func TestManager_FakeInput_KeepsRestarting(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this or the other test ever reproduce the problem? What about if you reverted my ClearPendingCheckin change that didn't completely fix the problem? Does it catch it then or not at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the tests catch the issue with no fixes applied at all? (is what I was trying to ask above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it with your change removed and sadly these did not show the symptoms either. I wanted to keep the tests in the change because overall I thought they still exercised the code path well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed I like them too, maybe if we ran them for long enough they'd catch it.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the tests and confirmed the agent can install and enroll with Fleet.

It is difficult to tell if the issue is completely gone (I thought it was the first time) but these changes look like they should prevent it.

@blakerouse
Copy link
Contributor Author

blakerouse commented Jan 19, 2023

Using the latest changes in this branch I added the following change and ran the TestManager_FakeInput_Restarts test. It causes the exact same issue that we are seeing in production where the Config is nil.

index b30a3721a..7538b4065 100644
--- a/pkg/component/runtime/runtime_comm.go
+++ b/pkg/component/runtime/runtime_comm.go
@@ -268,6 +268,18 @@ func (c *runtimeComm) checkin(server proto.ElasticAgent_CheckinV2Server, init *p
        // the channel as well as prevent the sender channel from reading from `c.checkinExpected` until
        // we have sent the message on `c.checkinObserved`.
        c.latestCheckinExpected(nil)
+
+       c.checkinExpected <- &proto.CheckinExpected{
+               Units: []*proto.UnitExpected{
+                       {
+                               Type:           proto.UnitType_INPUT,
+                               State:          proto.State_HEALTHY,
+                               ConfigStateIdx: 1,
+                               Config:         nil,
+                       },
+               },
+       }
+
        c.checkinObserved <- init
        close(waitExp)

So my change still does not 100% fix the issue, because it is possible that the goroutine reads from the channel and what is on the channel is invalid for a restart component.

I am working more on this PR to have a full proof solution. I want to ensure there is no window of possibility that this happens. I am also looking at adding some defensive code on the elastic-agent-client side to ensure that even if it does happen it handles the case correctly.

@cmacknz
Copy link
Member

cmacknz commented Jan 19, 2023

I completely removed all of the fixes including the first attempt at a fix that this PR replaces and ran the two unit tests here continuously for ~1 hour without reproducing a failure.

This appears particularly challenging to reproduce in an isolated way, I'm glad we managed to recreate it artificially at least.

if initObserved != nil {
// the next call to `CheckinExpected` must be from the initial `CheckinObserved` message
if observed != initObserved {
// not the initial observed message; we don't send it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any scenario where the initial observed message could be lost or altered before we get here? Unlikely but if it this were possible we'd be stuck unable to configure anything.

Is there a client side timeout on the Checkin request from the component? If there is that would guard against this by giving it a way to make a second request and reset the initCheckinObserved message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the lack of an expected message being sent to the client would cause it to miss subsequent checkins that would also get us out of this theoretical state via the eventual restart of the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will result in an eventual restart of the process by the command runtime. So this situation is self healing, but extremely unlikely.

@cmacknz
Copy link
Member

cmacknz commented Jan 20, 2023

The latest round of changes looks good to me, although I would feel better with some targeted tests to ensure this definitely fixes the bug, or that we can't accidentally deadlock the checkin process waiting for an initial message that doesn't match what we expect.

@cmacknz
Copy link
Member

cmacknz commented Jan 23, 2023

Installed locally and changed the output configuration a few times without issue.

@blakerouse blakerouse merged commit fefe64f into elastic:main Jan 23, 2023
@blakerouse blakerouse deleted the defensive-checkin-expected branch January 23, 2023 15:00
mergify bot pushed a commit that referenced this pull request Jan 23, 2023
…t a config (#2138)

* Fix issue where checkinExpected channel might have out dated information.

* Run mage fmt.

* Add changelog entry.

* Increase rate lime for failure in test for slow CI runners.

* Cleanups from code review.

* Refactor the design of ensuring an initial expected comes from the observed message.

(cherry picked from commit fefe64f)
blakerouse added a commit that referenced this pull request Jan 23, 2023
…t a config (#2138) (#2151)

* Fix issue where checkinExpected channel might have out dated information.

* Run mage fmt.

* Add changelog entry.

* Increase rate lime for failure in test for slow CI runners.

* Cleanups from code review.

* Refactor the design of ensuring an initial expected comes from the observed message.

(cherry picked from commit fefe64f)

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beats managed by agent can be left without a configured output
3 participants