Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Report running processes and their health statuses #2156

Closed
3 tasks
jen-huang opened this issue Jun 23, 2021 · 11 comments
Closed
3 tasks

[Elastic Agent] Report running processes and their health statuses #2156

jen-huang opened this issue Jun 23, 2021 · 11 comments
Labels
7.16-candidate Team:Elastic-Agent Label for the Agent team

Comments

@jen-huang
Copy link

This is related to elastic/kibana#75236 and elastic/kibana#99068, both of which are longer-term efforts around enabling more granular status reporting of "integrations" that are running on Elastic Agent. But Agent has no concept of integrations, only which inputs/processes are running.

Still, reporting that information is useful and would get us closer to our longer-term goals. In the short term, this would enable Endpoint to filter agents by which ones are running Endpoint without doing additional JOIN-like queries.

I'd like to propose that agents:

  • Report what inputs/processes are running
  • Report the health status of each
  • Store the above information in local_metadata field

One thing to consider in deciding the data structure of of how this information should be stored, is that in the future we will want to allow subprocesses to report their own additional meta information, such as Endpoint process reporting an "isolated" status.

@jen-huang jen-huang added the Team:Elastic-Agent Label for the Agent team label Jun 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/agent (Team:Agent)

@mostlyjason
Copy link

@kevinlog What kind of health status info do you want reported? I saw you have policy response data that seems to indicate whether its running successfully. I suppose that only covers initialization, not if the endpoint becomes unhealthy later?

@urso
Copy link

urso commented Jun 24, 2021

@mostlyjason don't we already have another meta-issue regarding status reporting?

@kevinlog
Copy link

@mostlyjason

What kind of health status info do you want reported? I saw you have policy response data that seems to indicate whether its running successfully. I suppose that only covers initialization, not if the endpoint becomes unhealthy later?

Endpoint will periodically update its Policy Response if there are meaningful events that change Endpoint's compliance with how the user configured it, so it could change during its lifecycle.

@ferullo could give more details on when this may happen.

@mostlyjason
Copy link

@kevinlog Do we need another health status reporting mechanism if we already have policy response status? What additional use cases do you require that are not offered by the policy response status?

@kevinlog
Copy link

kevinlog commented Jul 6, 2021

@mostlyjason sorry I missed this the first time.

Do we need another health status reporting mechanism if we already have policy response status? What additional use cases do you require that are not offered by the policy response status?

I don't believe Endpoint needs another mechanism, I just think that Fleet users may want additional insight if a subprocess isn't running correctly. Policy compliance for Endpoint is big. So if that's in a "Failed" state, it would be good to bubble that up to Agent so that it can be reported in the UI. Otherwise, all Agents are "Healthy".

I think we could do this in a generic way so that Integrations have the option to ship a "Success/Failure/Warning" status to let Fleet users know something isn't right. Then they could drill down further to individual Agents or solutions to investigate further.

Let me know if that makes sense

@mostlyjason
Copy link

++ sounds like a good idea to make policy responses a generic feature for all integrations. I haven't seen how it works currently, but conceptually it sounds good because it would provide a more structured error we could show on the agent details page, without the using having to dig through logs. It's also nice to have a uniform behavior if we don't have it already.

++ on having a failure response status put the agent into an unhealthy state so we keep our states consistent. Again, I'm not sure how that bubbles up but it sounds good conceptually.

As a general principal I think we don't expose processes to users directly, but the policy response could contain a aggregate of failures across all processes. We could show this aggregate info on the agent details page without exposing the underlying processes in the schema, which may result in a breaking change for users if we remove or change them in the future.

@jen-huang are you aligned on not exposing processes to users in the schema? How do you see this aligning with policy responses? Would it help to have a formal definition/design step for this issue?

@botelastic
Copy link

botelastic bot commented Jul 19, 2022

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jul 19, 2022
@botelastic botelastic bot closed this as completed Jan 15, 2023
@jen-huang jen-huang transferred this issue from elastic/beats Jan 23, 2023
@jen-huang jen-huang reopened this Jan 23, 2023
@stale stale bot removed the Stalled label Jan 23, 2023
@jen-huang
Copy link
Author

@pierrehilbert @nimarezainia Not sure if we have an appropriate meta issue that can supersede this one, so I am reopening for now but feel free to close and redirect.

@pierrehilbert
Copy link
Contributor

We have this one: https://github.com/elastic/ingest-dev/issues/1367

@jlind23
Copy link
Contributor

jlind23 commented May 14, 2024

Closing this as done.
cc @ycombinator

@jlind23 jlind23 closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7.16-candidate Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

8 participants