Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Implement agent integration health reporting #158826

Merged

Conversation

jillguyonnet
Copy link
Contributor

@jillguyonnet jillguyonnet commented Jun 1, 2023

Summary

Implement agent integration health reporting in Fleet UI.

Closes #154634

Screenshots

These screenshots were taken with an error (invalid config) on the system integration (metrics).

Before

The error affecting the system integration is not visible in the UI. To find it, the user would need to inspect the agent JSON or run the elastic-agent status command.

Screenshot 2023-06-08 at 16 59 31

After

The error affecting the system integration is surfaced in the UI:

Screenshot 2023-06-08 at 14 59 38

For reference, the following screenshots show existing behaviour with the Elastic Defend integration (errors in the policy response):

Screenshot 2023-06-08 at 14 59 52 Screenshot 2023-06-08 at 15 00 05

Steps to reproduce

  1. Enroll an agent in Fleet and add some integrations.
  2. Introduce some failure, e.g. malformed package policy.
  3. The failures should correctly be surfaced in the UI.

Checklist

Delete any items that are not applicable to this PR.

@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@karenzone
Copy link
Contributor

The documentation issue is here: elastic/ingest-docs#209
@jillguyonnet, let's touch base on docs needs, time frames, and how to collaborate. As always, we're looking forward to working with you.

@jillguyonnet jillguyonnet force-pushed the fleet/integration-health-reporting branch from 56f8e5d to f29cded Compare June 5, 2023 10:41
@jillguyonnet
Copy link
Contributor Author

@karenzone @zombieFox

FYI I've pasted screenshots of the latest WIP iteration in the PR description above, showing various parts of the integration status tree. Note that the policy response part for Elastic Defend is not part of this change, it is existing behaviour that I am showing for reference. The change affects the health status reported on integration inputs.

Your input will be much appreciated on various points:

  • Currently, only input component units are reported in this tree. The designs suggest outputs could be listed as well.
  • UX for units that have no data; in the above example, in the last screenshot, the system integration has nothing to report for winlog since the agent is running on Linux. The current implementation shows a EuiHealth icon with subdued state (grey colour) and a placeholder message (Not available) which could probably be improved.
  • TheEuiCallOut that reports the failure on the metrics part of the system integration has a title and a body. The body is retrieved from the agent response (that can be seen through the View agent JSON action). I put a provisional title (Failed) which might need to be discussed, improved or removed altogether. Note that there is no CTA button as it is unclear at the moment what the action should be.

cc @jlind23

@zombieFox
Copy link
Contributor

Thanks @jillguyonnet for the demo. Continuing from the call:

  • I just checked and EUI Callout does accept a name to change the icon. I suggest we use the new Error icon now found in EUI icon set.
Screenshot 2023-06-08 at 09 31 26

if (!agent.components) {
return packageErrorUnits;
}
return getInputStatusFromAgent(agent.components, packagePolicy).filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion for a shorter way with lodash:

hasPackageErrors = agent.components ? some(getInputStatusFromAgent(agent.components, packagePolicy), unit => unit.status === 'DEGRADED' || unit.status === 'FAILED') : false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @juliaElastic - there actually was some leftover unnecessary code there which I've removed.

@jillguyonnet jillguyonnet changed the title [WIP][Fleet] Implement agent integration health reporting [Fleet] Implement agent integration health reporting Jun 8, 2023
@jillguyonnet jillguyonnet added Team:Fleet Team label for Observability Data Collection Fleet team v8.9.0 release_note:enhancement labels Jun 8, 2023
@jillguyonnet jillguyonnet marked this pull request as ready for review June 8, 2023 15:04
@jillguyonnet jillguyonnet requested a review from a team as a code owner June 8, 2023 15:04
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jlind23 jlind23 requested a review from juliaElastic June 12, 2023 05:40
agentComponents: FleetServerAgentComponent[],
packagePolicy: PackagePolicy
): FleetServerAgentComponentUnit[] => {
const re = new RegExp(`(${packagePolicy.id}|${packagePolicy.package?.name})`);
Copy link
Contributor

@juliaElastic juliaElastic Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the regex on package name might return invalid results in case we have a package name that includes the name of another. We could use a stricter regex like ^name/.

@elastic/elastic-agent-control-plane Could you suggest what would be the best way to identify component units that belong to a package/package policy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Elastic Agent itself only understands inputs, not packages. Any information that would allow tying an input back to a package is injected by Fleet.

Looks like you have meta.package and package_policy_id to work with. The input ID has to be unique in the agent policy and looks like it is templated from something like $inputType-$package-$packagePolicyID so that might be a good hint for what to use if you want a unique value (you can probably find the code that generates this ID faster than I can :) )

  - id: system/metrics-system-4f510cb9-2f4e-4b81-8a19-9969abe1c924
    name: system-1
    revision: 1
    type: system/metrics
    use_output: default
    meta:
      package:
        name: system
        version: 1.31.1
    data_stream:
      namespace: default
    package_policy_id: 4f510cb9-2f4e-4b81-8a19-9969abe1c924

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this @juliaElastic - this is indeed one of the critical pieces of logic.
Interestingly, the yaml @cmacknz pasted above has a FullAgentPolicy type which we don't have access without an extra API call. While it would be easier (allowing us to simply retrieve units by input id), I'm not sure it's worth it. I have made a change where we only match the unit id against the package policy id; this means we only retrieve inputs, not outputs. Since we only surface the state of inputs, this should be sufficient for the present requirements.
Please let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to keep it simple and match on package policy id that works on inputs. We can come back to this later when we add support for outputs.

@juliaElastic
Copy link
Contributor

Looks good overall, asked a question about the regex used to find the package units.

@jlind23
Copy link
Contributor

jlind23 commented Jun 14, 2023

@zombieFox @karenzone we didn't get any feedback from the demo Jill gave so I take it as an approval and we will then move forward with the current state of this PR.
Thanks for your help

@jillguyonnet jillguyonnet force-pushed the fleet/integration-health-reporting branch from 081b9a9 to f413296 Compare June 14, 2023 09:41
@jillguyonnet
Copy link
Contributor Author

@juliaElastic I've pushed some changes addressing your review 🙏

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
fleet 813 817 +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
fleet 970.1KB 973.3KB +3.2KB
Unknown metric groups

ESLint disabled line counts

id before after diff
enterpriseSearch 13 15 +2
securitySolution 409 413 +4
total +6

Total ESLint disabled count

id before after diff
enterpriseSearch 14 16 +2
securitySolution 492 496 +4
total +6

History

  • 💛 Build #133744 was flaky 081b9a92ea44f32d7821643356b78808a54bd103
  • 💛 Build #133685 was flaky 73053752306cf12f36736466040d262647f022d8
  • 💔 Build #133320 failed 2f662c9723953e3cc685ebf8465c510de9688ca8
  • 💛 Build #132984 was flaky 6ffa626223f59decfb2defcfe7bd75e9f1697643
  • 💔 Build #132619 failed 70743f341f568f4ffec00a3dc3fe2f3a158cdaa4
  • 💔 Build #132606 failed d8e75d45e7ba8b34cb7e915db0c11c72848f7bde

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@jillguyonnet jillguyonnet merged commit 0d6657f into elastic:main Jun 14, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 14, 2023
@jillguyonnet jillguyonnet deleted the fleet/integration-health-reporting branch June 14, 2023 12:55
@jillguyonnet jillguyonnet added the QA:Ready for Testing Code is merged and ready for QA to validate label Jun 14, 2023
jillguyonnet added a commit that referenced this pull request Jun 22, 2023
## Summary

Fix a bug in the health reporting UI change in
#158826 where the page breaks when
the agent has no components.

Closes #159975

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting QA:Ready for Testing Code is merged and ready for QA to validate release_note:enhancement Team:Fleet Team label for Observability Data Collection Fleet team v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Fleet] Implement per-integration health reporting
10 participants