Monitoring should integrate with telemetry like any other plugin #32519

epixa · 2019-03-05T21:28:16Z

When monitoring is enabled, it modifies the way telemetry collectors fire in a problematic way. Specifically, it results in all of the various collectors firing frequently (every 10 seconds or so) rather than only when needed to send the telemetry payload. This is excessive, and the performance impact on the Elasticsearch cluster is non-trivial.

This is roughly how telemetry works when monitoring is disabled, which is the desirable behavior all of the time:

User loads Kibana in their browser
Kibana determines if telemetry has been sent in the last 24 hours, does nothing if it has
If the last telemetry payload happened more than 24 hours ago, the browser requests telemetry from the server
Server executes all of the registered telemetry collectors which fetch data and sends the combined results back to browser
Browser sends payload to Elastic telemetry service

This is roughly how telemetry works when monitoring is enabled, which is bad:

Perpetually in the background of the server, the monitoring agent collects monitoring information and fires all registered telemetry collectors every 10 seconds, and it sends all of this data to the monitoring cluster.
User loads Kibana in their browser
Kibana determines if telemetry has been sent in the last 24 hours, does nothing if it has
If the last telemetry payload happened more than 24 hours ago, the browser requests telemetry from the server
Server retrieves pre-computed telemetry data from monitoring cluster
Browser sends payload to Elastic telemetry service

Telemetry should be decoupled entirely from monitoring. Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload. The same process as any other plugins.

Note that the monitoring agent would still send its monitoring data to the monitoring cluster on the background interval as always. The only impact to monitoring here is that it is less tightly coupled to telemetry and as a result the telemetry collectors aren't fired alongside the monitoring agent collection.

I consider the current behavior a bug given the detrimental affect it can have on the Elasticsearch cluster.

elasticmachine · 2019-03-05T21:28:17Z

Pinging @elastic/kibana-platform

elasticmachine · 2019-03-05T21:28:18Z

Pinging @elastic/stack-monitoring

epixa · 2019-03-05T21:29:38Z

@pickypg @tsullivan Please take a look at this and let me know if I'm missing important considerations

pickypg · 2019-03-05T23:01:23Z

Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload.

This means that any monitoring cluster won't have that data and requires that every cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.

Perhaps a simpler solution is to just dramatically slow down the rate that monitoring collects telemetry (maybe once daily or hourly?) and have the monitoring-telemetry reader allow for a wider historical time range for its latest report.

Even better, the stack monitoring team could incorporate some of that data into the UI.

jinmu03 · 2019-03-06T00:02:23Z

@Bamieh

epixa · 2019-03-06T02:09:58Z

This means that any monitoring cluster won't have that data and requires that every cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.

This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's not how it works today?

tsullivan · 2019-03-06T15:32:03Z

Even better, the stack monitoring team could incorporate some of that data into the UI.

We've talked about this for a long time, but I'm starting to feel like a) the ship is really sailing away on this, and b) sticking a LOT of data in a user's monitoring cluster is not a good forcing function for us to plan how to make the data useful for them.

If you haven't opted in to telemetry for a Kibana install [of a remote cluster], that Kibana install shouldn't be collecting telemetry

++

dramatically slow down the rate that monitoring collects telemetry

Seriously, I don't think we should have any Kibana telemetry data in Monitoring. The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.

There is still to address the Kibana metricbeat collector which reads from Kibana stats API every 10 seconds, which triggers usage collector fetches. Perhaps it's time to make that a separate API. However, in the scenario where we take telemetry data out of Monitoring, I'm not sure what the consumer of that API would be.

chrisronline · 2019-03-06T15:56:30Z

I'll offer some more thoughts later, but one thing to consider with this ask is that we're migrating from internal collection to external collection of monitoring data and have this available for users today. As a result, we plan to deprecate and remove internal shippers in each stack product, including Kibana. We will be removing all code around kibana sending data directly to a monitoring cluster

pickypg · 2019-03-06T22:05:06Z

This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's not how it works today?

Correct. Kibana Monitoring has an all-or-nothing approach to what it collects, so all of this data would be included. That said, if telemetry is not enabled (still opt-in) then it won't send it (and it's wholly a waste).

The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.

Perhaps, but this means that we'd have less telemetry details about every Kibana instance excluding the one that happens to be sending the telemetry, for deployments that have more than one instance -- particularly for anyone that deployed multiple instances before we introduced Kibana Spaces for isolation, since we would not see their data when we live pulled.

We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.

jinmu03 · 2019-03-07T00:45:05Z

when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?

pickypg · 2019-03-07T17:25:56Z

when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?

Correct. The exporters are only relevant about where that .monitoring-kibana* index gets created / written to though. What matters to Kibana is: is collection enabled? Yes? Then it's going to fetch the data every interval (technically configurable, but defaults to 10s).

tsullivan · 2019-03-08T20:06:27Z

To @chrisronline 's point - I think this issue title should be changed. Monitoring in Kibana doesn't really "interact" with telemetry - it just so happens there is a usage service in Kibana that allows clients to register with it, and the service will call fetch on all the collectors every 10 seconds. Whether that is happening because Monitoring is enabled and internal collection is kicking off, or not (meaning an external Metricbeat is calling an API every 10 seconds and triggers data collection each time), it ends up being a lot of queries and that is the issue.

peterschretlen · 2019-03-08T20:06:38Z

We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.

Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.

tsullivan · 2019-03-08T20:09:51Z

Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.

Currently, we only know if the ES cluster has multiple Kibana indices in it (and therefore multiple instances running) when we pull Kibana stats from the Monitoring data. So we can get that number as it is now, but the solution I'm proposing would effectively do away with our ability to do that going forward.

jbudz · 2019-03-11T21:24:00Z

How does everyone feel about removing this collection behavior now and reintroducing it when:

we have a ui
fetching usage doesn't require a downstream request for each metric

I think it's especially important to remove this now before anything is built on it. The usage architecture needs to be reorganized to support more metrics at the monitoring interval. A longer interval may be worth exploring, but removal sounds more straight forward short term and will give us time to take a step back.

jbudz · 2019-03-19T17:48:24Z

@chrisronline @tsullivan @pickypg We have decided to add another interval(prolly once per day) other than removing the collection behavior. Do you have any thoughts on what it would look like to add a separate interval? Would it be a significant restructure?

chrisronline · 2019-03-19T18:27:15Z

I'm happy to take on the effort, but I'm not sure what @tsullivan had in mind exactly. Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?

jinmu03 · 2019-03-20T00:39:03Z

Did we ever have issue with tens of thousands of Kibana browsers consuming telemetry service concurrently?

tsullivan · 2019-03-20T16:26:55Z

Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?

It would be a once-per-day collection that fetches Stats collectors as well as Usage collectors.

Usage collectors can be filtered out here:

kibana/x-pack/plugins/monitoring/server/kibana_monitoring/bulk_uploader.js

Line 68 in 0835cd3

    
           const filterThem = _collectorSet => _collectorSet.getFilteredCollectorSet(c => c.ignoreForInternalUploader !== true);

with c => c instanceof UsageCollector

Here's where other code does that kind of filter for getting a set of only the Usage collectors:

kibana/src/legacy/server/usage/classes/collector_set.js

Line 104 in 4aa2a4c

    
           const usageCollectors = this.getFilteredCollectorSet(c => c instanceof UsageCollector);

Maybe it doesn't have to be a separate interval, but maybe the interval could check on a cached flag to determine if it's time to do a full fetch, and it would only do that once per day.

tsullivan · 2019-03-20T16:28:15Z

Trying to do c instanceof UsageCollector from the x-pack might be hard, because you'd have to import the class over from core. Maybe usage collectors just need an automatic flag that can be checked, similar to the ignoreForInternalUploader trick:

c => c.ignoreForInternalUploader !== true && c.isUsageCollector !== true

chrisronline · 2019-03-21T14:55:04Z

Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.

Add a flag (isUsageCollector) to each usage collector, as an actual property similar to ignoreForInternalUploader so monitoring can know it's a usage collector (see why we have to do it this way)
Modify the kibana monitoring collection code to ONLY collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the xpack_main plugin). However, the monitoring code will pull from stats collectors at the current interval (default to 10s) to ensure the monitoring service and UIs continue to function properly. This will mean that some .monitoring-kibana-* documents will contain a usage field, and some will not.
Update the .monitoring-kibana index template to include mappings for the usage field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clusters
Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)

I have this code working locally and will put up the PR if folks are happy with this approach.

tsullivan · 2019-03-28T20:53:46Z

Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data

Is this why mappings need to be included for the usage field?

chrisronline · 2019-03-29T00:40:39Z

@tsullivan Correct. We need someway to create a query that ignores documents that do not contain this field

tsullivan · 2019-03-29T23:24:51Z

Is usage an object field? Could it just be mapped as object ? We don't need the inner data to be mapped, correct?

chrisronline · 2019-04-01T12:36:23Z

@tsullivan Theoretically, yes. I tried it and was unsuccessful in getting the exists query to work unless I mapped a field inside of the object, but it's entirely possible there is some setting that could make this work. Doesn't matter to me either way, but we will need some mapping change

tsullivan · 2019-04-03T23:32:03Z

@chrisronline an exists: usage query should work if usage is just mapped as type: 'object'. I ran a quick test. LMK if you'd like to take it offline.

I'm asking this to avoid having to map any part of the combined data that gets returned from fetching all the collectors. It

elasticmachine · 2019-07-24T13:44:28Z

Pinging @elastic/kibana-stack-services

elasticmachine · 2019-12-12T16:33:45Z

Pinging @elastic/pulse (Team:Pulse)

afharo · 2020-12-10T14:19:00Z

Closing this issue now that Monitoring and Telemetry are decoupled (#83626). If anyone around here thinks it should remain open, please, feel free to reopen it 🙂

elasticmachine · 2021-10-01T21:46:59Z

Pinging @elastic/kibana-core (Team:Core)

epixa added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Monitoring Stack Monitoring team Feature:Telemetry labels Mar 5, 2019

jbudz mentioned this issue Mar 7, 2019

Investigate server background polling #32428

Closed

chrisronline mentioned this issue Mar 22, 2019

Only collect usage data once a day from kibana monitoring #33705

Closed

This was referenced Apr 5, 2019

Monitoring telemetry collection interval #34609

Merged

[Monitoring] Add usage mapping for monitoring-kibana index elastic/elasticsearch#40899

Merged

epixa added Team:Stack Services and removed Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Jul 24, 2019

bmcconaghy added Team:Pulse and removed Team:Stack Services labels Dec 12, 2019

Bamieh mentioned this issue Apr 22, 2020

[UsageCollection] Expose events API #64156

Closed

afharo closed this as completed Dec 10, 2020

lukeelmers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Oct 1, 2021

lukeelmers removed the Team:KibanaTelemetry label Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring should integrate with telemetry like any other plugin #32519

Monitoring should integrate with telemetry like any other plugin #32519

epixa commented Mar 5, 2019 •

edited by jinmu03

Loading

elasticmachine commented Mar 5, 2019

elasticmachine commented Mar 5, 2019

epixa commented Mar 5, 2019

pickypg commented Mar 5, 2019

jinmu03 commented Mar 6, 2019

epixa commented Mar 6, 2019

tsullivan commented Mar 6, 2019

chrisronline commented Mar 6, 2019

pickypg commented Mar 6, 2019

jinmu03 commented Mar 7, 2019 •

edited

Loading

pickypg commented Mar 7, 2019

tsullivan commented Mar 8, 2019

peterschretlen commented Mar 8, 2019

tsullivan commented Mar 8, 2019

jbudz commented Mar 11, 2019

jbudz commented Mar 19, 2019 •

edited by jinmu03

Loading

chrisronline commented Mar 19, 2019

jinmu03 commented Mar 20, 2019

tsullivan commented Mar 20, 2019

tsullivan commented Mar 20, 2019 •

edited

Loading

chrisronline commented Mar 21, 2019

tsullivan commented Mar 28, 2019

chrisronline commented Mar 29, 2019

tsullivan commented Mar 29, 2019

chrisronline commented Apr 1, 2019

tsullivan commented Apr 3, 2019

elasticmachine commented Jul 24, 2019

elasticmachine commented Dec 12, 2019

afharo commented Dec 10, 2020

elasticmachine commented Oct 1, 2021

Monitoring should integrate with telemetry like any other plugin #32519

Monitoring should integrate with telemetry like any other plugin #32519

Comments

epixa commented Mar 5, 2019 • edited by jinmu03 Loading

elasticmachine commented Mar 5, 2019

elasticmachine commented Mar 5, 2019

epixa commented Mar 5, 2019

pickypg commented Mar 5, 2019

jinmu03 commented Mar 6, 2019

epixa commented Mar 6, 2019

tsullivan commented Mar 6, 2019

chrisronline commented Mar 6, 2019

pickypg commented Mar 6, 2019

jinmu03 commented Mar 7, 2019 • edited Loading

pickypg commented Mar 7, 2019

tsullivan commented Mar 8, 2019

peterschretlen commented Mar 8, 2019

tsullivan commented Mar 8, 2019

jbudz commented Mar 11, 2019

jbudz commented Mar 19, 2019 • edited by jinmu03 Loading

chrisronline commented Mar 19, 2019

jinmu03 commented Mar 20, 2019

tsullivan commented Mar 20, 2019

tsullivan commented Mar 20, 2019 • edited Loading

chrisronline commented Mar 21, 2019

tsullivan commented Mar 28, 2019

chrisronline commented Mar 29, 2019

tsullivan commented Mar 29, 2019

chrisronline commented Apr 1, 2019

tsullivan commented Apr 3, 2019

elasticmachine commented Jul 24, 2019

elasticmachine commented Dec 12, 2019

afharo commented Dec 10, 2020

elasticmachine commented Oct 1, 2021

epixa commented Mar 5, 2019 •

edited by jinmu03

Loading

jinmu03 commented Mar 7, 2019 •

edited

Loading

jbudz commented Mar 19, 2019 •

edited by jinmu03

Loading

tsullivan commented Mar 20, 2019 •

edited

Loading