-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring should integrate with telemetry like any other plugin #32519
Comments
Pinging @elastic/kibana-platform |
Pinging @elastic/stack-monitoring |
@pickypg @tsullivan Please take a look at this and let me know if I'm missing important considerations |
This means that any monitoring cluster won't have that data and requires that every cluster has opted into telemetry in order to collect the data that previously would have been in monitoring. Perhaps a simpler solution is to just dramatically slow down the rate that monitoring collects telemetry (maybe once daily or hourly?) and have the monitoring-telemetry reader allow for a wider historical time range for its latest report. Even better, the stack monitoring team could incorporate some of that data into the UI. |
This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's not how it works today? |
We've talked about this for a long time, but I'm starting to feel like a) the ship is really sailing away on this, and b) sticking a LOT of data in a user's monitoring cluster is not a good forcing function for us to plan how to make the data useful for them.
++
Seriously, I don't think we should have any Kibana telemetry data in Monitoring. The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull. There is still to address the Kibana metricbeat collector which reads from Kibana stats API every 10 seconds, which triggers usage collector fetches. Perhaps it's time to make that a separate API. However, in the scenario where we take telemetry data out of Monitoring, I'm not sure what the consumer of that API would be. |
I'll offer some more thoughts later, but one thing to consider with this ask is that we're migrating from internal collection to external collection of monitoring data and have this available for users today. As a result, we plan to deprecate and remove internal shippers in each stack product, including Kibana. We will be removing all code around kibana sending data directly to a monitoring cluster |
Correct. Kibana Monitoring has an all-or-nothing approach to what it collects, so all of this data would be included. That said, if telemetry is not enabled (still opt-in) then it won't send it (and it's wholly a waste).
Perhaps, but this means that we'd have less telemetry details about every Kibana instance excluding the one that happens to be sending the telemetry, for deployments that have more than one instance -- particularly for anyone that deployed multiple instances before we introduced Kibana Spaces for isolation, since we would not see their data when we live pulled. We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats. |
when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter? |
Correct. The exporters are only relevant about where that |
To @chrisronline 's point - I think this issue title should be changed. Monitoring in Kibana doesn't really "interact" with telemetry - it just so happens there is a usage service in Kibana that allows clients to register with it, and the service will call fetch on all the collectors every 10 seconds. Whether that is happening because Monitoring is enabled and internal collection is kicking off, or not (meaning an external Metricbeat is calling an API every 10 seconds and triggers data collection each time), it ends up being a lot of queries and that is the issue. |
Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff. |
Currently, we only know if the ES cluster has multiple Kibana indices in it (and therefore multiple instances running) when we pull Kibana stats from the Monitoring data. So we can get that number as it is now, but the solution I'm proposing would effectively do away with our ability to do that going forward. |
How does everyone feel about removing this collection behavior now and reintroducing it when:
I think it's especially important to remove this now before anything is built on it. The usage architecture needs to be reorganized to support more metrics at the monitoring interval. A longer interval may be worth exploring, but removal sounds more straight forward short term and will give us time to take a step back. |
@chrisronline @tsullivan @pickypg We have decided to add another interval(prolly once per day) other than removing the collection behavior. Do you have any thoughts on what it would look like to add a separate interval? Would it be a significant restructure? |
I'm happy to take on the effort, but I'm not sure what @tsullivan had in mind exactly. Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled? |
Did we ever have issue with tens of thousands of Kibana browsers consuming telemetry service concurrently? |
It would be a once-per-day collection that fetches Stats collectors as well as Usage collectors. Usage collectors can be filtered out here:
c => c instanceof UsageCollector
Here's where other code does that kind of filter for getting a set of only the Usage collectors:
Maybe it doesn't have to be a separate interval, but maybe the interval could check on a cached flag to determine if it's time to do a full fetch, and it would only do that once per day. |
Trying to do
|
Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.
I have this code working locally and will put up the PR if folks are happy with this approach. |
Is this why mappings need to be included for the |
@tsullivan Correct. We need someway to create a query that ignores documents that do not contain this field |
Is |
@tsullivan Theoretically, yes. I tried it and was unsuccessful in getting the |
@chrisronline an I'm asking this to avoid having to map any part of the combined data that gets returned from fetching all the collectors. It |
Pinging @elastic/kibana-stack-services |
Pinging @elastic/pulse (Team:Pulse) |
Closing this issue now that Monitoring and Telemetry are decoupled (#83626). If anyone around here thinks it should remain open, please, feel free to reopen it 🙂 |
Pinging @elastic/kibana-core (Team:Core) |
When monitoring is enabled, it modifies the way telemetry collectors fire in a problematic way. Specifically, it results in all of the various collectors firing frequently (every 10 seconds or so) rather than only when needed to send the telemetry payload. This is excessive, and the performance impact on the Elasticsearch cluster is non-trivial.
This is roughly how telemetry works when monitoring is disabled, which is the desirable behavior all of the time:
This is roughly how telemetry works when monitoring is enabled, which is bad:
Telemetry should be decoupled entirely from monitoring. Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload. The same process as any other plugins.
Note that the monitoring agent would still send its monitoring data to the monitoring cluster on the background interval as always. The only impact to monitoring here is that it is less tightly coupled to telemetry and as a result the telemetry collectors aren't fired alongside the monitoring agent collection.
I consider the current behavior a bug given the detrimental affect it can have on the Elasticsearch cluster.
The text was updated successfully, but these errors were encountered: