-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Agent status improvements #75236
Comments
Pinging @elastic/ingest-management (Team:Ingest Management) |
@hbharding I'd be interested in your input on this |
Hey @mostlyjason, thanks for putting this together. I think this simplifies a lot. I especially like that we can use these same statuses to communicate deployment status. I created a Whimsical diagram that attempts to capture everything you've described. I organized the diagram so that statuses on the left will always supercede statuses on the right if any of the conditions inside are true. For example, if a policy is "unenrolling", it can not also be in a "unhealthy" or "healthy" state. I shared this in our meeting yesterday with Endpoint, and there were questions about items inside the "unhealthy" status. "Unhealthy" makes sense when some integrations could have issues while other integrations are running fine. But what if the agent is "online" and has an error that prevents all data from being sent? Shouldn't we elevate this type of status so that it appears to be more critical? Perhaps it makes sense to introduce a red "error" status like so: Some questions I have are:
|
Also, to recap a discussion from yesterday: re: Integration errors, we talked about maybe adding a way to "pivot" the agent table so that it is focused on policies. If an agent is unhealthy due to an integration error (Endpoint, for example), it is likely that multiple agents will have the same issue because they use the same policy. On the Fleet page, if we report 200 agents as being "unhealthy", how can the user isolate the agents to only see agents that have unhealthy because of an Endpoint Integration error? |
You are right the enrolling status we have now is more an enrolled status, should we have an enrolled status for agent between the enrollment and the first checkin? |
@michalpristas or @nchaulet I can't find the issue for the Elastic Agent related to this effort did you ever created one? |
@ph there as no specific issue for that but this was partially implemented here #84434 (adding the Healthy, unhealthy, updating status) There is no per integration status now as we postponed this and the status is still computed by Kibana and not reported by the agent so we do not have the Updating Policy status |
Just want to describe the goal for the next phase is to so expose improved status for inputs in the Agent details page, filtered by integration. That applies to the second user story:
|
Describe the feature:
Currently, the fleet page shows the status of agents including whether they are online, offline, or have an error. It also shows whether agents are out of date, and enrolling or unenrolling. However, there is no way to see which agents have integrations that are reporting errors or are unhealthy. Instead these agents are reported as online and green, and this may be misinterpreted as healthy. We need a better way to indicate to administrators that agents are not running as expected and require attention. Endpoint security reported this use case #74708
I'd like to propose refactoring the statuses so that the fleet page shows:
HealthCheck
function in every beats. beats#17737, or an upgrade failed and it was rolled backAdditionally, we can indicate when there are manual agent binary updates or agent policy available using a separate flag.
The reason we'd want to provide a summary of statuses on the overview page is to provide a rollup so fleet administrators can determine what is in flux and what requires their attention. Administrators can also filter the list to see just the set of agents requiring their attention, and combine that filter with others to look at a particular agent configuration or integration. Optionally, there could be a way to display sub-status information like "Updating: enrolling".
The agent details page will show both the overview status and the finer-grained status information to help users identify the cause of problems. It will provide a way for users to see which integrations are healthy, which are disabled due to user preference or condition, and which have errors or failed a health check along with more information on the reason why. There may be a summary of the health for each integration, and the user can see the activity log for more detail.
This also allows us to communicate the status of deployments using the same statuses, rather than having separate statuses just for deployments. #72537
Describe a specific use case for the feature:
The text was updated successfully, but these errors were encountered: