Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring should integrate with telemetry like any other plugin #32519

Closed
epixa opened this issue Mar 5, 2019 · 30 comments
Closed

Monitoring should integrate with telemetry like any other plugin #32519

epixa opened this issue Mar 5, 2019 · 30 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Monitoring Stack Monitoring team

Comments

@epixa
Copy link
Contributor

epixa commented Mar 5, 2019

When monitoring is enabled, it modifies the way telemetry collectors fire in a problematic way. Specifically, it results in all of the various collectors firing frequently (every 10 seconds or so) rather than only when needed to send the telemetry payload. This is excessive, and the performance impact on the Elasticsearch cluster is non-trivial.

This is roughly how telemetry works when monitoring is disabled, which is the desirable behavior all of the time:

  1. User loads Kibana in their browser
  2. Kibana determines if telemetry has been sent in the last 24 hours, does nothing if it has
  3. If the last telemetry payload happened more than 24 hours ago, the browser requests telemetry from the server
  4. Server executes all of the registered telemetry collectors which fetch data and sends the combined results back to browser
  5. Browser sends payload to Elastic telemetry service

This is roughly how telemetry works when monitoring is enabled, which is bad:

  1. Perpetually in the background of the server, the monitoring agent collects monitoring information and fires all registered telemetry collectors every 10 seconds, and it sends all of this data to the monitoring cluster.
  2. User loads Kibana in their browser
  3. Kibana determines if telemetry has been sent in the last 24 hours, does nothing if it has
  4. If the last telemetry payload happened more than 24 hours ago, the browser requests telemetry from the server
  5. Server retrieves pre-computed telemetry data from monitoring cluster
  6. Browser sends payload to Elastic telemetry service

Telemetry should be decoupled entirely from monitoring. Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload. The same process as any other plugins.

Note that the monitoring agent would still send its monitoring data to the monitoring cluster on the background interval as always. The only impact to monitoring here is that it is less tightly coupled to telemetry and as a result the telemetry collectors aren't fired alongside the monitoring agent collection.

I consider the current behavior a bug given the detrimental affect it can have on the Elasticsearch cluster.

@epixa epixa added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Monitoring Stack Monitoring team Feature:Telemetry labels Mar 5, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring

@epixa
Copy link
Contributor Author

epixa commented Mar 5, 2019

@pickypg @tsullivan Please take a look at this and let me know if I'm missing important considerations

@pickypg
Copy link
Member

pickypg commented Mar 5, 2019

Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload.

This means that any monitoring cluster won't have that data and requires that every cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.

Perhaps a simpler solution is to just dramatically slow down the rate that monitoring collects telemetry (maybe once daily or hourly?) and have the monitoring-telemetry reader allow for a wider historical time range for its latest report.

Even better, the stack monitoring team could incorporate some of that data into the UI.

@jinmu03
Copy link
Contributor

jinmu03 commented Mar 6, 2019

@Bamieh

@epixa
Copy link
Contributor Author

epixa commented Mar 6, 2019

This means that any monitoring cluster won't have that data and requires that every cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.

This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's not how it works today?

@tsullivan
Copy link
Member

Even better, the stack monitoring team could incorporate some of that data into the UI.

We've talked about this for a long time, but I'm starting to feel like a) the ship is really sailing away on this, and b) sticking a LOT of data in a user's monitoring cluster is not a good forcing function for us to plan how to make the data useful for them.

If you haven't opted in to telemetry for a Kibana install [of a remote cluster], that Kibana install shouldn't be collecting telemetry

++

dramatically slow down the rate that monitoring collects telemetry

Seriously, I don't think we should have any Kibana telemetry data in Monitoring. The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.

There is still to address the Kibana metricbeat collector which reads from Kibana stats API every 10 seconds, which triggers usage collector fetches. Perhaps it's time to make that a separate API. However, in the scenario where we take telemetry data out of Monitoring, I'm not sure what the consumer of that API would be.

@chrisronline
Copy link
Contributor

I'll offer some more thoughts later, but one thing to consider with this ask is that we're migrating from internal collection to external collection of monitoring data and have this available for users today. As a result, we plan to deprecate and remove internal shippers in each stack product, including Kibana. We will be removing all code around kibana sending data directly to a monitoring cluster

@pickypg
Copy link
Member

pickypg commented Mar 6, 2019

This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's not how it works today?

Correct. Kibana Monitoring has an all-or-nothing approach to what it collects, so all of this data would be included. That said, if telemetry is not enabled (still opt-in) then it won't send it (and it's wholly a waste).

The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.

Perhaps, but this means that we'd have less telemetry details about every Kibana instance excluding the one that happens to be sending the telemetry, for deployments that have more than one instance -- particularly for anyone that deployed multiple instances before we introduced Kibana Spaces for isolation, since we would not see their data when we live pulled.

We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.

@jinmu03
Copy link
Contributor

jinmu03 commented Mar 7, 2019

when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?

@pickypg
Copy link
Member

pickypg commented Mar 7, 2019

when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?

Correct. The exporters are only relevant about where that .monitoring-kibana* index gets created / written to though. What matters to Kibana is: is collection enabled? Yes? Then it's going to fetch the data every interval (technically configurable, but defaults to 10s).

@tsullivan
Copy link
Member

To @chrisronline 's point - I think this issue title should be changed. Monitoring in Kibana doesn't really "interact" with telemetry - it just so happens there is a usage service in Kibana that allows clients to register with it, and the service will call fetch on all the collectors every 10 seconds. Whether that is happening because Monitoring is enabled and internal collection is kicking off, or not (meaning an external Metricbeat is calling an API every 10 seconds and triggers data collection each time), it ends up being a lot of queries and that is the issue.

@peterschretlen
Copy link
Contributor

We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.

Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.

@tsullivan
Copy link
Member

Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.

Currently, we only know if the ES cluster has multiple Kibana indices in it (and therefore multiple instances running) when we pull Kibana stats from the Monitoring data. So we can get that number as it is now, but the solution I'm proposing would effectively do away with our ability to do that going forward.

@jbudz
Copy link
Member

jbudz commented Mar 11, 2019

How does everyone feel about removing this collection behavior now and reintroducing it when:

  1. we have a ui
  2. fetching usage doesn't require a downstream request for each metric

I think it's especially important to remove this now before anything is built on it. The usage architecture needs to be reorganized to support more metrics at the monitoring interval. A longer interval may be worth exploring, but removal sounds more straight forward short term and will give us time to take a step back.

@jbudz
Copy link
Member

jbudz commented Mar 19, 2019

@chrisronline @tsullivan @pickypg We have decided to add another interval(prolly once per day) other than removing the collection behavior. Do you have any thoughts on what it would look like to add a separate interval? Would it be a significant restructure?

@chrisronline
Copy link
Contributor

I'm happy to take on the effort, but I'm not sure what @tsullivan had in mind exactly. Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?

@jinmu03
Copy link
Contributor

jinmu03 commented Mar 20, 2019

Did we ever have issue with tens of thousands of Kibana browsers consuming telemetry service concurrently?

@tsullivan
Copy link
Member

Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?

It would be a once-per-day collection that fetches Stats collectors as well as Usage collectors.

Usage collectors can be filtered out here:

const filterThem = _collectorSet => _collectorSet.getFilteredCollectorSet(c => c.ignoreForInternalUploader !== true);
with c => c instanceof UsageCollector

Here's where other code does that kind of filter for getting a set of only the Usage collectors:

const usageCollectors = this.getFilteredCollectorSet(c => c instanceof UsageCollector);

Maybe it doesn't have to be a separate interval, but maybe the interval could check on a cached flag to determine if it's time to do a full fetch, and it would only do that once per day.

@tsullivan
Copy link
Member

tsullivan commented Mar 20, 2019

Trying to do c instanceof UsageCollector from the x-pack might be hard, because you'd have to import the class over from core. Maybe usage collectors just need an automatic flag that can be checked, similar to the ignoreForInternalUploader trick:

c => c.ignoreForInternalUploader !== true && c.isUsageCollector !== true

@chrisronline
Copy link
Contributor

Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.

  • Add a flag (isUsageCollector) to each usage collector, as an actual property similar to ignoreForInternalUploader so monitoring can know it's a usage collector (see why we have to do it this way)

  • Modify the kibana monitoring collection code to ONLY collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the xpack_main plugin). However, the monitoring code will pull from stats collectors at the current interval (default to 10s) to ensure the monitoring service and UIs continue to function properly. This will mean that some .monitoring-kibana-* documents will contain a usage field, and some will not.

  • Update the .monitoring-kibana index template to include mappings for the usage field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clusters

  • Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)

I have this code working locally and will put up the PR if folks are happy with this approach.

@tsullivan
Copy link
Member

Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data

Is this why mappings need to be included for the usage field?

@chrisronline
Copy link
Contributor

@tsullivan Correct. We need someway to create a query that ignores documents that do not contain this field

@tsullivan
Copy link
Member

Is usage an object field? Could it just be mapped as object ? We don't need the inner data to be mapped, correct?

@chrisronline
Copy link
Contributor

@tsullivan Theoretically, yes. I tried it and was unsuccessful in getting the exists query to work unless I mapped a field inside of the object, but it's entirely possible there is some setting that could make this work. Doesn't matter to me either way, but we will need some mapping change

@tsullivan
Copy link
Member

@chrisronline an exists: usage query should work if usage is just mapped as type: 'object'. I ran a quick test. LMK if you'd like to take it offline.

I'm asking this to avoid having to map any part of the combined data that gets returned from fetching all the collectors. It

@epixa epixa added Team:Stack Services and removed Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Jul 24, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@elasticmachine
Copy link
Contributor

Pinging @elastic/pulse (Team:Pulse)

@afharo
Copy link
Member

afharo commented Dec 10, 2020

Closing this issue now that Monitoring and Telemetry are decoupled (#83626). If anyone around here thinks it should remain open, please, feel free to reopen it 🙂

@afharo afharo closed this as completed Dec 10, 2020
@lukeelmers lukeelmers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Oct 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Monitoring Stack Monitoring team
Projects
None yet
Development

No branches or pull requests