-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fleet Usage telemetry extension #145353
Fleet Usage telemetry extension #145353
Conversation
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@juliaElastic @jlind23 I am wondering if it's safe to collect the whole fleet server input (probably the same question for |
agent_checkin_status_last_1h: { error: 0, degraded: 0 }, | ||
}; | ||
|
||
export const getAgentData = async (esClient?: ElasticsearchClient): Promise<AgentData> => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be a great improvements to write an integration tests for these collectors(I think it's doable with a jest integration test).
Maybe this can be done in a follow up issue if it's too much work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into adding integration tests.
I think I might create a separate pr for the agent logs collector too, since that depends on ES change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added integration test.
I wonder how could we improve the integration test framework, it seems rather slow to start up an ES instance for every test. And it is slow to test changes locally too.
Raised a separate issue for improvements: #145988
Pinging @elastic/fleet (Team:Fleet) |
}, | ||
}, | ||
}); | ||
{ signal: abortController.signal } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abortController helps to terminate queries if the 1m
task timeout is reached. Didn't pass it to all es queries yet as it requires changes in many files.
Currently the task takes about 300ms
to run on average, will keep an eye on it after adding the agent logs queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being mindful about performance here. As I understand it, things like background tasks, telemetry collection, etc will eventually run on Kibana's separate "service" process, so heavy operations like this will eventually be optimized such that they aren't taking time or resources away from the main Kibana server/UI. Still, it's good to make sure we're not running a bunch of long running queries here.
@elasticmachine merge upstream |
@nchaulet might include some sensitive data indeed. Hosts and port value can definitely be removed without issues. I guess for the others it is going to be a case by case decision. |
Updated to remove host port information from fleet server config, and also removed fleet server hosts (as it didn't contain anything useful other than host port). I suppose it doesn't add much value to send config ids to telemetry. |
Yes this looks correct.
I think just total checkin failure counts is acceptable here. The granular timestamps/metrics are probably just noise for telemetry purposes.
Yes let's only "alert" on non-healthy checkins.
Not sure on @jlind23's original ask here. I think agent logs are most useful here, but I'm sure fleet server logs could be helpful as well. We'd need to update the ES permissions to include
This does seem redundant, and I'd rather stick to the previously defined checkin failure field and not risk duplication. |
I found that |
A few notes about extracting data from the logs:
|
Yes, this is already in a pr: elastic/elasticsearch#91701
Yes, that is why I added the APM span and task timeout, to keep the performance in check.
I'll make these changes, BTW I planned to query the last 1h since we are publishing the telemetry hourly.
I think I tried this and didn't work for the
|
kuery: `${PACKAGE_POLICY_SAVED_OBJECT_TYPE}.package.name:fleet_server`, | ||
}); | ||
const getInputConfig = (item: any) => { | ||
let config = (item.inputs[0] ?? {}).compiled_input; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still some sensitive info here we could have some path in ssl
what about just whitelisting what we think is interesting here? maybe just timeouts
and limits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored with whitelisting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for adding that integration test it really help to understand what is collected! and I agree we probably do not need to have a ES started for all the tests, maybe we can move some of these test to the FTR config
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Unknown metric groupsESLint disabled in files
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
## Summary Closes elastic/ingest-dev#1261 Added a snippet to the telemetry that I added for each requirement. Please review and let me know if any changes are needed. Also asked a few questions below. @jlind23 @kpollich 6. is blocked by [elasticsearch change](elastic/elasticsearch#91701) to give kibana_system the missing privilege to read logs-elastic_agent* indices. Took inspiration for task versioning from https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186 - [x] 1. Elastic Agent versions Versions of all the Elastic Agent running: `agent.version` field on `.fleet-agents` documents ``` "agent_versions": [ "8.6.0" ], ``` - [x] 2. Fleet server configuration Think we can query for `.fleet-policies` where some `input` has `type: 'fleet-server'` for this, as well as use the `Fleet Server Hosts` settings that we define via saved objects in Fleet ``` "fleet_server_config": { "policies": [ { "input_config": { "server": { "limits.max_agents": 10000 }, "server.runtime": "gc_percent:20" } } ] } ``` - [x] 3. Number of policies Count of `.fleet-policies` index To confirm, did we mean agent policies here? ``` "agent_policies": { "count": 7, ``` - [x] 4. Output type contained in those policies Collecting this from ts logic, querying from `.fleet-policies` index. The alternative would be to write a painless script (because the `outputs` are an object with dynamic keys, we can't do an aggregation directly). ``` "agent_policies": { "output_types": [ "elasticsearch" ] } ``` Did we mean to just collect the types here, or any other info? e.g. output urls - [x] 5. Average number of checkin failures We only have the most recent checkin status and timestamp on `.fleet-agents`. Do we mean here to publish the total last checkin failure count? E.g. 3 if 3 agents are in failure checkin status currently. Or do we mean to publish specific info for all agents (`last_checkin_status`, `last_checkin` time, `last_checkin_message`)? Are the only statuses `error` and `degraded` that we want to send? ``` "agent_last_checkin_status": { "error": 0, "degraded": 0 }, ``` - [ ] 6. Top 3 most common errors in the Elastic Agent logs Do we mean here elastic-agent logs only, or fleet-server logs as well (maybe separately)? I found an alternative way to query the message field using sampler and categorize text aggregation: ``` GET logs-elastic_agent*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "log.level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } }, "aggregations": { "message_sample": { "sampler": { "shard_size": 200 }, "aggs": { "categories": { "categorize_text": { "field": "message", "size": 10 } } } } } } ``` Example response: ``` "aggregations": { "message_sample": { "doc_count": 112, "categories": { "buckets": [ { "doc_count": 73, "key": "failed to unenroll offline agents", "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?", "max_matching_length": 36 }, { "doc_count": 7, "key": """stderr panic close of closed channel n ngoroutine running Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n ``` - [x] 7. Number of checkin failure over the past period of time I think this is almost the same as elastic#5. The difference would be to report new failures happened only in the last hour, or report all agents in failure state. (which would be an increasing number if the agent stays in failed state). Do we want these 2 separate telemetry fields? EDIT: removed the last1hr query, instead added a new field to report agents enrolled per policy (top 10). See comments below. ``` "agent_checkin_status": { "error": 3, "degraded": 0 }, "agents_per_policy": [2, 1000], ``` - [x] 8. Number of Elastic Agent and number of fleet server This is already there in the existing telemetry: ``` "agents": { "total_enrolled": 0, "healthy": 0, "unhealthy": 0, "offline": 0, "total_all_statuses": 1, "updating": 0 }, "fleet_server": { "total_enrolled": 0, "healthy": 0, "unhealthy": 0, "offline": 0, "updating": 0, "total_all_statuses": 0, "num_host_urls": 1 }, ``` ### Checklist - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit e00e26e)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
# Backport This will backport the following commits from `main` to `8.6`: - [Fleet Usage telemetry extension (#145353)](#145353) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Julia Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-11-23T09:22:20Z","message":"Fleet Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet to the telemetry that I added for each requirement.\r\nPlease review and let me know if any changes are needed.\r\nAlso asked a few questions below. @jlind23 @kpollich \r\n\r\n6. is blocked by [elasticsearch\r\nchange](elastic/elasticsearch#91701) to give\r\nkibana_system the missing privilege to read logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n- [x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent running: `agent.version` field on\r\n`.fleet-agents` documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n ],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can query for `.fleet-policies` where some `input` has `type:\r\n'fleet-server'` for this, as well as use the `Fleet Server Hosts`\r\nsettings that we define via saved objects in Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\": [\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n \"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\": \"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n \"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those policies\r\nCollecting this from ts logic, querying from `.fleet-policies` index.\r\nThe alternative would be to write a painless script (because the\r\n`outputs` are an object with dynamic keys, we can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\": {\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n }\r\n```\r\n\r\nDid we mean to just collect the types here, or any other info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin failures\r\nWe only have the most recent checkin status and timestamp on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin status currently.\r\nOr do we mean to publish specific info for all agents\r\n(`last_checkin_status`, `last_checkin` time, `last_checkin_message`)?\r\nAre the only statuses `error` and `degraded` that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\": {\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6. Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean here elastic-agent logs only, or fleet-server logs as well\r\n(maybe separately)?\r\n\r\nI found an alternative way to query the message field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n \"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\": \"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n \"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n \"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n \"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n \"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n }\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n \"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n \"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline agents\",\r\n \"regex\": \".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n \"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n \"key\": \"\"\"stderr panic close of closed channel n ngoroutine running Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past period of time\r\n\r\nI think this is almost the same as #5. The difference would be to report\r\nnew failures happened only in the last hour, or report all agents in\r\nfailure state. (which would be an increasing number if the agent stays\r\nin failed state).\r\nDo we want these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr query, instead added a new field to report\r\nagents enrolled per policy (top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\": {\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n \"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of Elastic Agent and number of fleet server\r\n\r\nThis is already there in the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n \"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n \"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n },\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.6.0","v8.7.0"],"number":145353,"url":"https://github.com/elastic/kibana/pull/145353","mergeCommit":{"message":"Fleet Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet to the telemetry that I added for each requirement.\r\nPlease review and let me know if any changes are needed.\r\nAlso asked a few questions below. @jlind23 @kpollich \r\n\r\n6. is blocked by [elasticsearch\r\nchange](elastic/elasticsearch#91701) to give\r\nkibana_system the missing privilege to read logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n- [x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent running: `agent.version` field on\r\n`.fleet-agents` documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n ],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can query for `.fleet-policies` where some `input` has `type:\r\n'fleet-server'` for this, as well as use the `Fleet Server Hosts`\r\nsettings that we define via saved objects in Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\": [\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n \"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\": \"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n \"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those policies\r\nCollecting this from ts logic, querying from `.fleet-policies` index.\r\nThe alternative would be to write a painless script (because the\r\n`outputs` are an object with dynamic keys, we can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\": {\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n }\r\n```\r\n\r\nDid we mean to just collect the types here, or any other info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin failures\r\nWe only have the most recent checkin status and timestamp on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin status currently.\r\nOr do we mean to publish specific info for all agents\r\n(`last_checkin_status`, `last_checkin` time, `last_checkin_message`)?\r\nAre the only statuses `error` and `degraded` that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\": {\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6. Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean here elastic-agent logs only, or fleet-server logs as well\r\n(maybe separately)?\r\n\r\nI found an alternative way to query the message field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n \"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\": \"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n \"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n \"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n \"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n \"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n }\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n \"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n \"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline agents\",\r\n \"regex\": \".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n \"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n \"key\": \"\"\"stderr panic close of closed channel n ngoroutine running Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past period of time\r\n\r\nI think this is almost the same as #5. The difference would be to report\r\nnew failures happened only in the last hour, or report all agents in\r\nfailure state. (which would be an increasing number if the agent stays\r\nin failed state).\r\nDo we want these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr query, instead added a new field to report\r\nagents enrolled per policy (top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\": {\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n \"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of Elastic Agent and number of fleet server\r\n\r\nThis is already there in the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n \"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n \"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n },\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"8.6","label":"v8.6.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/145353","number":145353,"mergeCommit":{"message":"Fleet Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet to the telemetry that I added for each requirement.\r\nPlease review and let me know if any changes are needed.\r\nAlso asked a few questions below. @jlind23 @kpollich \r\n\r\n6. is blocked by [elasticsearch\r\nchange](elastic/elasticsearch#91701) to give\r\nkibana_system the missing privilege to read logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n- [x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent running: `agent.version` field on\r\n`.fleet-agents` documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n ],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can query for `.fleet-policies` where some `input` has `type:\r\n'fleet-server'` for this, as well as use the `Fleet Server Hosts`\r\nsettings that we define via saved objects in Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\": [\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n \"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\": \"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n \"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those policies\r\nCollecting this from ts logic, querying from `.fleet-policies` index.\r\nThe alternative would be to write a painless script (because the\r\n`outputs` are an object with dynamic keys, we can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\": {\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n }\r\n```\r\n\r\nDid we mean to just collect the types here, or any other info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin failures\r\nWe only have the most recent checkin status and timestamp on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin status currently.\r\nOr do we mean to publish specific info for all agents\r\n(`last_checkin_status`, `last_checkin` time, `last_checkin_message`)?\r\nAre the only statuses `error` and `degraded` that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\": {\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6. Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean here elastic-agent logs only, or fleet-server logs as well\r\n(maybe separately)?\r\n\r\nI found an alternative way to query the message field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n \"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\": \"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n \"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n \"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n \"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n \"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n }\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n \"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n \"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline agents\",\r\n \"regex\": \".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n \"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n \"key\": \"\"\"stderr panic close of closed channel n ngoroutine running Stop ngh.neting.cc/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past period of time\r\n\r\nI think this is almost the same as #5. The difference would be to report\r\nnew failures happened only in the last hour, or report all agents in\r\nfailure state. (which would be an increasing number if the agent stays\r\nin failed state).\r\nDo we want these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr query, instead added a new field to report\r\nagents enrolled per policy (top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\": {\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n \"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of Elastic Agent and number of fleet server\r\n\r\nThis is already there in the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n \"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n \"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n },\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}}]}] BACKPORT--> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@juliaElastic @kpollich sorry for the last minute comment but i started looking at the stats and one that would be better to have is number of agent per versions. |
## Summary Changed the list of agent versions to agent count per version in fleet-usages telemetry as requested [here](#145353 (comment)). ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Changed the list of agent versions to agent count per version in fleet-usages telemetry as requested [here](elastic#145353 (comment)). ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios (cherry picked from commit e771fc8)
#147169) # Backport This will backport the following commits from `main` to `8.6`: - [changed agent versions to agents per version telemetry (#147164)](#147164) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Julia Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-07T11:23:30Z","message":"changed agent versions to agents per version telemetry (#147164)\n\n## Summary\r\n\r\nChanged the list of agent versions to agent count per version in\r\nfleet-usages telemetry as requested\r\n[here](https://github.com/elastic/kibana/pull/145353#issuecomment-1331783758).\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"e771fc8e9fa5f551d8692fc8558e5d7e2cbfef79","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147164,"url":"https://github.com/elastic/kibana/pull/147164","mergeCommit":{"message":"changed agent versions to agents per version telemetry (#147164)\n\n## Summary\r\n\r\nChanged the list of agent versions to agent count per version in\r\nfleet-usages telemetry as requested\r\n[here](https://github.com/elastic/kibana/pull/145353#issuecomment-1331783758).\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"e771fc8e9fa5f551d8692fc8558e5d7e2cbfef79"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/147164","number":147164,"mergeCommit":{"message":"changed agent versions to agents per version telemetry (#147164)\n\n## Summary\r\n\r\nChanged the list of agent versions to agent count per version in\r\nfleet-usages telemetry as requested\r\n[here](https://github.com/elastic/kibana/pull/145353#issuecomment-1331783758).\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"e771fc8e9fa5f551d8692fc8558e5d7e2cbfef79"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co>
@jlind23 added agents per version telemetry: |
Thanks @juliaElastic this a great piece of work 👍🏼 |
Summary
Closes https://github.com/elastic/ingest-dev/issues/1261
Added a snippet to the telemetry that I added for each requirement. Please review and let me know if any changes are needed.
Also asked a few questions below. @jlind23 @kpollich
Took inspiration for task versioning from https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186
Versions of all the Elastic Agent running:
agent.version
field on.fleet-agents
documentsThink we can query for
.fleet-policies
where someinput
hastype: 'fleet-server'
for this, as well as use theFleet Server Hosts
settings that we define via saved objects in FleetCount of
.fleet-policies
indexTo confirm, did we mean agent policies here?
Collecting this from ts logic, querying from
.fleet-policies
index. The alternative would be to write a painless script (because theoutputs
are an object with dynamic keys, we can't do an aggregation directly).Did we mean to just collect the types here, or any other info? e.g. output urls
We only have the most recent checkin status and timestamp on
.fleet-agents
.Do we mean here to publish the total last checkin failure count? E.g. 3 if 3 agents are in failure checkin status currently.
Or do we mean to publish specific info for all agents (
last_checkin_status
,last_checkin
time,last_checkin_message
)?Are the only statuses
error
anddegraded
that we want to send?Do we mean here elastic-agent logs only, or fleet-server logs as well (maybe separately)?
I found an alternative way to query the message field using sampler and categorize text aggregation:
Example response:
I think this is almost the same as #5. The difference would be to report new failures happened only in the last hour, or report all agents in failure state. (which would be an increasing number if the agent stays in failed state).
Do we want these 2 separate telemetry fields?
EDIT: removed the last1hr query, instead added a new field to report agents enrolled per policy (top 10). See comments below.
This is already there in the existing telemetry:
Checklist