Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server. #28260

Merged
merged 5 commits into from
Oct 14, 2021

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Oct 5, 2021

What does this PR do?

It allows the metrics endpoint to run during Fleet Server bootstrap mode. Adds timeouts (including negative for indefinite) for waiting on the Elastic Agent daemon and the Fleet Server bootstrap process.

Why is it important?

This is needed by Cloud to allow it to check the status of the Elastic Agent even when Fleet Server cannot complete bootstrap process. Cloud will set the timeout to be indefinite and the system will only check every 10 mins after the exponential backoff to see if it should continue.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@blakerouse blakerouse added Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Oct 5, 2021
@blakerouse blakerouse self-assigned this Oct 5, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 5, 2021
@mergify
Copy link
Contributor

mergify bot commented Oct 5, 2021

This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 5, 2021
@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Oct 5, 2021
@blakerouse blakerouse added backport-v7.15.0 Automated backport with mergify backport-v7.16.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify labels Oct 5, 2021
@mergify mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Oct 5, 2021
@blakerouse blakerouse marked this pull request as ready for review October 5, 2021 18:08
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 5, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 86 min 52 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

@simitt
Copy link
Contributor

simitt commented Oct 6, 2021

@andresrc are you ok with backporting this as a fix to 7.15?

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. I also want to test on ECE though, if you could hold back with merging until then.

@blakerouse
Copy link
Contributor Author

/package

@jlind23
Copy link
Collaborator

jlind23 commented Oct 7, 2021

@simitt @blakerouse did you have a chance to test it yet?

@simitt
Copy link
Contributor

simitt commented Oct 11, 2021

I tested and created elastic/fleet-server#763 as a follow up as the observed behavior was not quite the expected one, and the agent/fleet-server were very noisily logging the same errors.
Also, with the changes in this PR, the agent would always also return a fleet-server process with a pid in the /processes response, althought the fleet-server is not healthy.

@blakerouse
Copy link
Contributor Author

I have the fix for elastic/fleet-server#763 here elastic/fleet-server#768. That will provide the behavior we need for this to work properly.

@blakerouse blakerouse force-pushed the http-metrics-in-bootstrap branch from a29e234 to 68631ae Compare October 12, 2021 12:44
@blakerouse
Copy link
Contributor Author

/package

@mergify
Copy link
Contributor

mergify bot commented Oct 13, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b http-metrics-in-bootstrap upstream/http-metrics-in-bootstrap
git merge upstream/master
git push upstream http-metrics-in-bootstrap

@simitt
Copy link
Contributor

simitt commented Oct 13, 2021

I retested with the fleet-server fix, and the agent and fleet-server work as expected now on cloud. The healthcheck endpoint is immediately exposed, the container is considered healthy, while fleet-server is still trying to start up. The agent returns status: STARTING for the fleet-server.

fleet-server still logs every ~5sec that it is waiting for the policy, but the agent logging is pretty silent.

@blakerouse blakerouse merged commit 15366ff into elastic:master Oct 14, 2021
@blakerouse blakerouse deleted the http-metrics-in-bootstrap branch October 14, 2021 13:05
mergify bot pushed a commit that referenced this pull request Oct 14, 2021
…meouts for Fleet Server. (#28260)

* Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server.

* Add changelog.

* Add the persistent agent configuration to the fleet.yml in bootstrap mode.

* Fix format issues.

(cherry picked from commit 15366ff)
mergify bot pushed a commit that referenced this pull request Oct 14, 2021
…meouts for Fleet Server. (#28260)

* Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server.

* Add changelog.

* Add the persistent agent configuration to the fleet.yml in bootstrap mode.

* Fix format issues.

(cherry picked from commit 15366ff)
mergify bot pushed a commit that referenced this pull request Oct 14, 2021
…meouts for Fleet Server. (#28260)

* Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server.

* Add changelog.

* Add the persistent agent configuration to the fleet.yml in bootstrap mode.

* Fix format issues.

(cherry picked from commit 15366ff)
blakerouse added a commit that referenced this pull request Oct 14, 2021
…meouts for Fleet Server. (#28260) (#28445)

* Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server.

* Add changelog.

* Add the persistent agent configuration to the fleet.yml in bootstrap mode.

* Fix format issues.

(cherry picked from commit 15366ff)

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
blakerouse added a commit that referenced this pull request Oct 14, 2021
…meouts for Fleet Server. (#28260) (#28444)

* Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server.

* Add changelog.

* Add the persistent agent configuration to the fleet.yml in bootstrap mode.

* Fix format issues.

(cherry picked from commit 15366ff)

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Icedroid pushed a commit to Icedroid/beats that referenced this pull request Nov 1, 2021
…meouts for Fleet Server. (elastic#28260)

* Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server.

* Add changelog.

* Add the persistent agent configuration to the fleet.yml in bootstrap mode.

* Fix format issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.15.0 Automated backport with mergify backport-v7.16.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy
4 participants