Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate unhealthy reason (input/output/other) on agent documents #3338

Merged
merged 8 commits into from
Mar 18, 2024

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Mar 12, 2024

elasticsearch pr to add unhealthy_reason keyword mapping: elastic/elasticsearch#106246

What is the problem this PR solves?

Add unhealthy_reason to fleet server metrics published regularly.

How does this PR solve the problem?

Calculate unhealthy_reason from agent.components on checkin and save in agent doc.

How to test this PR locally

  • enroll an agent with docker
  • add endpoint integration, expect an input and output unit error status on the agent doc
  • check that unhealthy_reason is added to the agent doc
GET .fleet-agents/_search
{
  "_source": ["unhealthy_reason"], 
    "sort": [
    {
      "last_checkin": {
        "order": "desc"
      }
    }
  ]
}

   "hits": [
      {
        "_index": ".fleet-agents-7",
        "_id": "f7406324-ed6b-4605-b9c2-05be026332b3",
        "_score": null,
        "_source": {
          "unhealthy_reason": [
            "input", "output"
          ]
        },
        "sort": [
          1710250909000
        ]
      },

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates https://github.com/elastic/ingest-dev/issues/2522

@juliaElastic juliaElastic added the enhancement New feature or request label Mar 12, 2024
@juliaElastic juliaElastic self-assigned this Mar 12, 2024
@juliaElastic juliaElastic changed the title calculate unhealthy_reason on agent doc Calculate unhealthy reason (input/output/other) on agent documents Mar 12, 2024
hasUnhealthyInput := false
hasUnhealthyOutput := false
hasUnhealthyComponent := false
reqComponentsArray, ok := reqComponents.([]interface{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at adding schema on agent.components instead of all this parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the schema and it simplifies the code by a lot.
It needs more testing though, I'm not sure what happens if there are other (unmapped) properties in components, we should probably keep them for debug purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested with a healthy endpoint input, that extra properties are not being removed after adding the schema:

 "components": [
    {
      "id": "endpoint-default",
      "type": "endpoint",
      "status": "HEALTHY",
      "message": "Healthy: communicating with endpoint service",
      "units": [
        {
          "id": "endpoint-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Applied policy {e14510ab-83b8-4c40-af40-519ea5203adf}",
          "payload": {
            "error": {
              "code": 0,
              "message": "Success"
            }
          }
        },

@juliaElastic juliaElastic marked this pull request as ready for review March 12, 2024 15:01
@juliaElastic juliaElastic requested a review from a team as a code owner March 12, 2024 15:01
@juliaElastic
Copy link
Contributor Author

buildkite run perf-tests

@nchaulet nchaulet self-requested a review March 13, 2024 13:05
}

var outComponents []byte

// Compare the deserialized meta structures and return the bytes to update if different
if !reflect.DeepEqual(reqComponents, agentComponents) {
if !reflect.DeepEqual(reqComponents, agent.Components) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have already a test that test we do not update if not different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, I'll take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests to verify when components equals / not equals

@juliaElastic juliaElastic requested a review from nchaulet March 13, 2024 15:46
Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

2 New issues
0 Security Hotspots
94.1% 94.1% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@juliaElastic juliaElastic merged commit 1572693 into elastic:main Mar 18, 2024
8 checks passed
juliaElastic added a commit to elastic/kibana that referenced this pull request Mar 18, 2024
…178605)

## Summary

Closes elastic/ingest-dev#2522

Added `unhealthy_reason` aggregation when querying agent metrics.

The [mapping
change](elastic/elasticsearch#106246) and
[fleet-server change](elastic/fleet-server#3338)
is needed to be merged first to verify end to end.

Steps to verify:
- enroll an agent with docker
- add endpoint integration, expect an input and output unit error status
on the agent doc
- wait a few seconds so that the agent metrics are published
- verify that the agent metrics include `unhealthy_reason`, using the
query below

```
GET metrics-fleet_server.agent_status-default/_search
{
  "_source": ["fleet.agents"]
}

  "hits": [
      {
        "_index": ".ds-metrics-fleet_server.agent_status-default-2024.03.11-000001",
        "_id": "3JdPioUh-9j8DxQrAAABjjclRhU",
        "_score": 1,
        "_source": {
          "fleet": {
            "agents": {
              "enrolled": 12,
              "healthy": 0,
              "inactive": 0,
              "offline": 11,
              "total": 13,
              "unenrolled": 1,
              "unhealthy": 1,
              "updating": 0,
              "upgrading_step": {
                "downloading": 0,
                "extracting": 0,
                "failed": 0,
                "replacing": 0,
                "requested": 0,
                "restarting": 0,
                "rollback": 0,
                "scheduled": 0,
                "watching": 0
              },
              "unhealthy_reason": {
                  "input": 1,
                  "output": 1
                }
            }
          }
        }
      },
```


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants