-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerts] Performance benchmarks #40264
Comments
Pinging @elastic/kibana-stack-services |
see PR #40291 |
Since this issue was last updated, Kibana is now doing some perf/load testing themselves. We should probably build on what they've done. For more info, see issue #73189 (comment) |
Some additional thoughts. We should aim to be able to run some manually launched but otherwise automated set of test on cloud that:
There are a ton of knobs and dials, but given the combinatorial explosion, we should start small :-) I've lately been measuring the "throughput" of the alerting / actions tasks running, by looking at the actions:execute and alerting:execute event documents - counts via date histogram. This is a rough number telling us how many alerts/actions are running per time unit. It seems to provide a pretty reasonable number, based on experiments of adding/reducing Kibanas on cloud. We should also figure out some stats to gauge the general "health" of ES and Kibana. Probably CPU and memory usage would be a decent start, and adding some more ES stats later would be good. In the end, would be nice to have a report showing data comparing some how these combinations change some of these metrics. I've been using the index threshold alert, and feeding the index it's querying against with data live, to control whether actions will be running or not. Seems like a decent alert to test with. I've been using the server log action, which might actually have about the same latency as a "real" action (since most are HTTP calls to other cloud services), but perhaps working in a webhook call to some "interesting" and not spammy system would be more realistic. |
I'm closing this issue now that we have the kbn-alert-load tool built to measure performance benchmarks. There are two follow up issues created that will be prioritized separately:
|
Alerting system load is affected by a number of factors:
These can create a variety of load patterns. Under the hood, both alert checks and actions are handled with Task Manager which is backed by Elasticsearch, each of which will have throughput limits. As the system evolves we need a way to reproduce different types and sizes of load and observe the performance characteristics in different environments.
The objective of this issue is to build out such a tool, there are command line utilities already like @pmuellr repositories for
kbn-actions
andkbn-alerts
as well as alerting samples that we could built upon to make this easy to setup, run, teardown.I think ideally we'd have some way to control the variables above, and have a generic alert type that could take one or more elasticsearch queries (in SQL or ES DSL) to control the load of the alert check.
Steps
To-Do:
To-Do kbn-alert-load:
Support automatic deployment sizing conversions(Automatic deployment sizing conversion in kbn-alert-load tool #88388)Move the tool into Kibana(Move kbn-alert-load tool into Kibana alerting #88389)Performance study:
Original description
Ran a stress test yesterday with an alert that always triggers and action. Created 1000 of them, interval 1s, action .server_log.
Never crashed or anything, but ES was steady at > 100% the entire time. Kibana steady at < 10%. No noticeable memory growth. Ran for ~12 hours.
Need to look into the ES perf ...
The text was updated successfully, but these errors were encountered: