Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet Usage telemetry extension #145353

Merged
merged 16 commits into from
Nov 23, 2022

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Nov 16, 2022

Summary

Closes https://github.com/elastic/ingest-dev/issues/1261

Added a snippet to the telemetry that I added for each requirement. Please review and let me know if any changes are needed.
Also asked a few questions below. @jlind23 @kpollich

  1. is blocked by elasticsearch change to give kibana_system the missing privilege to read logs-elastic_agent* indices.

Took inspiration for task versioning from https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186

  • 1. Elastic Agent versions
    Versions of all the Elastic Agent running: agent.version field on .fleet-agents documents
"agent_versions": [
    "8.6.0"
  ],
  • 2. Fleet server configuration
    Think we can query for .fleet-policies where some input has type: 'fleet-server' for this, as well as use the Fleet Server Hosts settings that we define via saved objects in Fleet
  "fleet_server_config": {
    "policies": [
      {
        "input_config": {
          "server": {
            "limits.max_agents": 10000
          },
          "server.runtime": "gc_percent:20"
        }
      }
    ]
  }
  • 3. Number of policies
    Count of .fleet-policies index

To confirm, did we mean agent policies here?

 "agent_policies": {
    "count": 7,
  • 4. Output type contained in those policies
    Collecting this from ts logic, querying from .fleet-policies index. The alternative would be to write a painless script (because the outputs are an object with dynamic keys, we can't do an aggregation directly).
"agent_policies": {
    "output_types": [
      "elasticsearch"
    ]
  }

Did we mean to just collect the types here, or any other info? e.g. output urls

  • 5. Average number of checkin failures
    We only have the most recent checkin status and timestamp on .fleet-agents.

Do we mean here to publish the total last checkin failure count? E.g. 3 if 3 agents are in failure checkin status currently.
Or do we mean to publish specific info for all agents (last_checkin_status, last_checkin time, last_checkin_message)?
Are the only statuses error and degraded that we want to send?

  "agent_last_checkin_status": {
    "error": 0,
    "degraded": 0
  },
  • 6. Top 3 most common errors in the Elastic Agent logs

Do we mean here elastic-agent logs only, or fleet-server logs as well (maybe separately)?

I found an alternative way to query the message field using sampler and categorize text aggregation:

GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}

Example response:

"aggregations": {
    "message_sample": {
      "doc_count": 112,
      "categories": {
        "buckets": [
          {
            "doc_count": 73,
            "key": "failed to unenroll offline agents",
            "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?",
            "max_matching_length": 36
          },
          {
            "doc_count": 7,
            "key": """stderr panic close of closed channel n ngoroutine running Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n
  • 7. Number of checkin failure over the past period of time

I think this is almost the same as #5. The difference would be to report new failures happened only in the last hour, or report all agents in failure state. (which would be an increasing number if the agent stays in failed state).
Do we want these 2 separate telemetry fields?

EDIT: removed the last1hr query, instead added a new field to report agents enrolled per policy (top 10). See comments below.

  "agent_checkin_status": {
    "error": 3,
    "degraded": 0
  },
  "agents_per_policy": [2, 1000],
  • 8. Number of Elastic Agent and number of fleet server

This is already there in the existing telemetry:

  "agents": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "total_all_statuses": 1,
    "updating": 0
  },
  "fleet_server": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "updating": 0,
    "total_all_statuses": 0,
    "num_host_urls": 1
  },

Checklist

@juliaElastic juliaElastic added release_note:skip Skip the PR/issue when compiling release notes v8.7.0 labels Nov 16, 2022
@juliaElastic juliaElastic self-assigned this Nov 16, 2022
@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic juliaElastic marked this pull request as ready for review November 18, 2022 14:07
@juliaElastic juliaElastic requested a review from a team as a code owner November 18, 2022 14:07
@nchaulet nchaulet self-requested a review November 18, 2022 14:08
@nchaulet
Copy link
Member

@juliaElastic @jlind23 I am wondering if it's safe to collect the whole fleet server input (probably the same question for host_urls values too) as telemetry data, does it could have sensitive info here?
Maybe we could sanitize the info we really found interesting here.

agent_checkin_status_last_1h: { error: 0, degraded: 0 },
};

export const getAgentData = async (esClient?: ElasticsearchClient): Promise<AgentData> => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be a great improvements to write an integration tests for these collectors(I think it's doable with a jest integration test).
Maybe this can be done in a follow up issue if it's too much work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into adding integration tests.

I think I might create a separate pr for the agent logs collector too, since that depends on ES change.

Copy link
Contributor Author

@juliaElastic juliaElastic Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added integration test.
I wonder how could we improve the integration test framework, it seems rather slow to start up an ES instance for every test. And it is slow to test changes locally too.
Raised a separate issue for improvements: #145988

@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 18, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

},
},
});
{ signal: abortController.signal }
Copy link
Contributor Author

@juliaElastic juliaElastic Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abortController helps to terminate queries if the 1m task timeout is reached. Didn't pass it to all es queries yet as it requires changes in many files.

Currently the task takes about 300ms to run on average, will keep an eye on it after adding the agent logs queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being mindful about performance here. As I understand it, things like background tasks, telemetry collection, etc will eventually run on Kibana's separate "service" process, so heavy operations like this will eventually be optimized such that they aren't taking time or resources away from the main Kibana server/UI. Still, it's good to make sure we're not running a bunch of long running queries here.

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@jlind23
Copy link
Contributor

jlind23 commented Nov 21, 2022

@nchaulet might include some sensitive data indeed. Hosts and port value can definitely be removed without issues. I guess for the others it is going to be a case by case decision.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Nov 21, 2022

might include some sensitive data indeed. Hosts and port value can definitely be removed without issues. I guess for the others it is going to be a case by case decision.

Updated to remove host port information from fleet server config, and also removed fleet server hosts (as it didn't contain anything useful other than host port). I suppose it doesn't add much value to send config ids to telemetry.

@kpollich
Copy link
Member

To confirm, did we mean agent policies here?

Yes this looks correct.

Do we mean here to publish the total last checkin failure count? E.g. 3 if 3 agents are in failure checkin status currently.
Or do we mean to publish specific info for all agents (last_checkin_status, last_checkin time, last_checkin_message)?

I think just total checkin failure counts is acceptable here. The granular timestamps/metrics are probably just noise for telemetry purposes.

Are the only statuses error and degraded that we want to send?

Yes let's only "alert" on non-healthy checkins.

Do we mean here elastic-agent logs only, or fleet-server logs as well (maybe separately)?

Not sure on @jlind23's original ask here. I think agent logs are most useful here, but I'm sure fleet server logs could be helpful as well. We'd need to update the ES permissions to include fleet_server-logs-* as well for this, right?

I think this is almost the same as #5. The difference would be to report new failures happened only in the last hour, or report all agents in failure state. (which would be an increasing number if the agent stays in failed state).
Do we want these 2 separate telemetry fields?

This does seem redundant, and I'd rather stick to the previously defined checkin failure field and not risk duplication.

@juliaElastic
Copy link
Contributor Author

Not sure on @jlind23's original ask here. I think agent logs are most useful here, but I'm sure fleet server logs could be helpful as well. We'd need to update the ES permissions to include fleet_server-logs-* as well for this, right?

I found that logs-elastic_agent.fleet_server* pattern contains fleet server logs.

@joshdover
Copy link
Contributor

A few notes about extracting data from the logs:

  • We likely need to add privileges for kibana_system to read these data streams
  • We need to be particularly performance conscious and conservative on these searches as it's quite likely to be a lot of data
  • We should limit the time range to the past 24 hrs
  • We should use bool: filter instead of bool: must. Filter provides better performance because it's results are cacheable by aggregations and queries.
  • We may get better performance with the random_sampler agg instead of sampler since sampler at the cost of less interesting results. random_sampler is in tech preview, but I think it's ok if we use it for telemetry anyways.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Nov 22, 2022

A few notes about extracting data from the logs:

  • We likely need to add privileges for kibana_system to read these data streams

Yes, this is already in a pr: elastic/elasticsearch#91701

  • We need to be particularly performance conscious and conservative on these searches as it's quite likely to be a lot of data

Yes, that is why I added the APM span and task timeout, to keep the performance in check.

  • We should limit the time range to the past 24 hrs
  • We should use bool: filter instead of bool: must. Filter provides better performance because it's results are cacheable by aggregations and queries.

I'll make these changes, BTW I planned to query the last 1h since we are publishing the telemetry hourly.

  • We may get better performance with the random_sampler agg instead of sampler since sampler at the cost of less interesting results. random_sampler is in tech preview, but I think it's ok if we use it for telemetry anyways.

I think I tried this and didn't work for the message field, I'll check again.
Getting this error on random_sampler agg:

 "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on [message] in [.ds-logs-elastic_agent-default-2022.11.21-000001]. Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [message] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      },

@nchaulet nchaulet self-requested a review November 22, 2022 16:04
kuery: `${PACKAGE_POLICY_SAVED_OBJECT_TYPE}.package.name:fleet_server`,
});
const getInputConfig = (item: any) => {
let config = (item.inputs[0] ?? {}).compiled_input;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still some sensitive info here we could have some path in ssl what about just whitelisting what we think is interesting here? maybe just timeouts and limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored with whitelisting.

Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for adding that integration test it really help to understand what is collected! and I agree we probably do not need to have a ES started for all the tests, maybe we can move some of these test to the FTR config

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #31 / machine learning - anomaly detection create jobs from lens "after each" hook for "can create a single metric job from vis with single layer"
  • [job] [logs] FTR Configs #31 / machine learning - anomaly detection create jobs from lens with wizard can create multi metric job from vis with single layer

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 59 65 +6
osquery 109 115 +6
securitySolution 442 448 +6
total +20

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 67 73 +6
osquery 110 117 +7
securitySolution 519 525 +6
total +21

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic merged commit e00e26e into elastic:main Nov 23, 2022
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 23, 2022
## Summary

Closes elastic/ingest-dev#1261

Added a snippet to the telemetry that I added for each requirement.
Please review and let me know if any changes are needed.
Also asked a few questions below. @jlind23 @kpollich

6. is blocked by [elasticsearch
change](elastic/elasticsearch#91701) to give
kibana_system the missing privilege to read logs-elastic_agent* indices.

Took inspiration for task versioning from
https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186

- [x] 1. Elastic Agent versions
Versions of all the Elastic Agent running: `agent.version` field on
`.fleet-agents` documents

```
"agent_versions": [
    "8.6.0"
  ],
```

- [x] 2. Fleet server configuration
Think we can query for `.fleet-policies` where some `input` has `type:
'fleet-server'` for this, as well as use the `Fleet Server Hosts`
settings that we define via saved objects in Fleet

```
  "fleet_server_config": {
    "policies": [
      {
        "input_config": {
          "server": {
            "limits.max_agents": 10000
          },
          "server.runtime": "gc_percent:20"
        }
      }
    ]
  }
```

- [x] 3. Number of policies
Count of `.fleet-policies` index

To confirm, did we mean agent policies here?

```
 "agent_policies": {
    "count": 7,
```

- [x] 4. Output type contained in those policies
Collecting this from ts logic, querying from `.fleet-policies` index.
The alternative would be to write a painless script (because the
`outputs` are an object with dynamic keys, we can't do an aggregation
directly).

```
"agent_policies": {
    "output_types": [
      "elasticsearch"
    ]
  }
```

Did we mean to just collect the types here, or any other info? e.g.
output urls

- [x] 5. Average number of checkin failures
We only have the most recent checkin status and timestamp on
`.fleet-agents`.

Do we mean here to publish the total last checkin failure count? E.g. 3
if 3 agents are in failure checkin status currently.
Or do we mean to publish specific info for all agents
(`last_checkin_status`, `last_checkin` time, `last_checkin_message`)?
Are the only statuses `error` and `degraded` that we want to send?

```
  "agent_last_checkin_status": {
    "error": 0,
    "degraded": 0
  },
```

- [ ] 6. Top 3 most common errors in the Elastic Agent logs

Do we mean here elastic-agent logs only, or fleet-server logs as well
(maybe separately)?

I found an alternative way to query the message field using sampler and
categorize text aggregation:
```
GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}
```
Example response:
```
"aggregations": {
    "message_sample": {
      "doc_count": 112,
      "categories": {
        "buckets": [
          {
            "doc_count": 73,
            "key": "failed to unenroll offline agents",
            "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?",
            "max_matching_length": 36
          },
          {
            "doc_count": 7,
            "key": """stderr panic close of closed channel n ngoroutine running Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n
```

- [x] 7.  Number of checkin failure over the past period of time

I think this is almost the same as elastic#5. The difference would be to report
new failures happened only in the last hour, or report all agents in
failure state. (which would be an increasing number if the agent stays
in failed state).
Do we want these 2 separate telemetry fields?

EDIT: removed the last1hr query, instead added a new field to report
agents enrolled per policy (top 10). See comments below.

```
  "agent_checkin_status": {
    "error": 3,
    "degraded": 0
  },
  "agents_per_policy": [2, 1000],
```

- [x] 8. Number of Elastic Agent and number of fleet server

This is already there in the existing telemetry:
```
  "agents": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "total_all_statuses": 1,
    "updating": 0
  },
  "fleet_server": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "updating": 0,
    "total_all_statuses": 0,
    "num_host_urls": 1
  },
```

### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit e00e26e)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.6

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Nov 23, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [Fleet Usage telemetry extension
(#145353)](#145353)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-11-23T09:22:20Z","message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.6.0","v8.7.0"],"number":145353,"url":"https://github.com/elastic/kibana/pull/145353","mergeCommit":{"message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"8.6","label":"v8.6.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/145353","number":145353,"mergeCommit":{"message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@jlind23
Copy link
Contributor

jlind23 commented Nov 30, 2022

@juliaElastic @kpollich sorry for the last minute comment but i started looking at the stats and one that would be better to have is number of agent per versions.
We now have number of agent and what are the versions but we do not know how much Elastic Agent per versions are running.
@juliaElastic would that be feasible?

juliaElastic added a commit that referenced this pull request Dec 7, 2022
## Summary

Changed the list of agent versions to agent count per version in
fleet-usages telemetry as requested
[here](#145353 (comment)).


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 7, 2022
## Summary

Changed the list of agent versions to agent count per version in
fleet-usages telemetry as requested
[here](elastic#145353 (comment)).

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

(cherry picked from commit e771fc8)
kibanamachine referenced this pull request Dec 7, 2022
#147169)

# Backport

This will backport the following commits from `main` to `8.6`:
- [changed agent versions to agents per version telemetry
(#147164)](#147164)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-07T11:23:30Z","message":"changed
agent versions to agents per version telemetry (#147164)\n\n##
Summary\r\n\r\nChanged the list of agent versions to agent count per
version in\r\nfleet-usages telemetry as
requested\r\n[here](https://github.com/elastic/kibana/pull/145353#issuecomment-1331783758).\r\n\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"e771fc8e9fa5f551d8692fc8558e5d7e2cbfef79","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147164,"url":"https://github.com/elastic/kibana/pull/147164","mergeCommit":{"message":"changed
agent versions to agents per version telemetry (#147164)\n\n##
Summary\r\n\r\nChanged the list of agent versions to agent count per
version in\r\nfleet-usages telemetry as
requested\r\n[here](https://github.com/elastic/kibana/pull/145353#issuecomment-1331783758).\r\n\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"e771fc8e9fa5f551d8692fc8558e5d7e2cbfef79"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/147164","number":147164,"mergeCommit":{"message":"changed
agent versions to agents per version telemetry (#147164)\n\n##
Summary\r\n\r\nChanged the list of agent versions to agent count per
version in\r\nfleet-usages telemetry as
requested\r\n[here](https://github.com/elastic/kibana/pull/145353#issuecomment-1331783758).\r\n\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"e771fc8e9fa5f551d8692fc8558e5d7e2cbfef79"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Co-authored-by: Julia Bardi <julia.bardi@elastic.co>
@juliaElastic
Copy link
Contributor Author

@jlind23 added agents per version telemetry:
https://stack-telemetry.elastic.dev/goto/6da82a10-79f1-11ed-a03e-a77716e362c6
image

@jlind23
Copy link
Contributor

jlind23 commented Dec 12, 2022

Thanks @juliaElastic this a great piece of work 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.6.0 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants