Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Out of the box alerting #68805

Merged
merged 85 commits into from
Jul 14, 2020

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Jun 10, 2020

Refactors #62793
Refactors #61685
Relates to #42960

This PR introduces quite a few things to the Stack Monitoring UI:

Creation

We want these alert to be created/enabled by default without the user needing to know or do anything, but that is not yet possible. As a temporary solution, we will attempt to create these alerts every time the monitoring UI is loaded. If the alerts already exist, nothing will happen (duplicate alerts will not be created).

Visibility

Firing scenario

The alerts will appear in the UI when triggered, such as:

Screen Shot 2020-06-30 at 2 08 34 PM

Clicking into the alert itself results in a list view with all "firing" alerts and the timestamp of when the alert was triggered (this data is stored by the alerting team)

Screen Shot 2020-06-30 at 2 13 36 PM

Clicking into a single alert will give the user some useful information about the details of the alert as well as potential resolution steps:

Screen Shot 2020-06-30 at 2 25 53 PM

Clicking View alert configuration will present the familiar flyout:

Screen Shot 2020-06-30 at 2 27 35 PM

Non firing scenario

So, these screenshots showcase the visibility of firing alerts, but users can also gain visibility into these alerts through setup mode.

Screen Shot 2020-06-30 at 2 30 43 PM

The UX will be the same when clicking on alerts in this context, except there will not by any useful information about the alert as it's not firing, but the View alert configuration button will exist allowing the user to change properties of the alert.

Screen Shot 2020-06-30 at 2 33 15 PM

Screen Shot 2020-06-30 at 2 37 18 PM

Legacy watcher-based alerts

Unfortunately, we are not able to permanently disable existing watcher-based cluster alerts until elastic/elasticsearch#50032 is resolved.

In the meantime, we will allow them to co-exist with our new Kibana alerts. The alerts themselves will exist as Kibana alerts, but will require the watch history to indicate a firing scenario before the Kibana alert itself fires. We are doing this because we can't stop the watches from doing what they do now, which will index into .monitoring-alerts-* and send an email (if configured). We don't want to step

The user will not be able to know the difference, and once we can fully disable watcher-based cluster alerts, we can convert these to full-fledge Kibana alerts behind the scenes and the user doesn't need to know.

Testing

To enable these new alerts, all you need to do is pull down this PR, start Kibana, and navigate to the Stack Monitoring UI. This will create all alerts, but there is slightly more you'll need to do to actually test the majority of the alerts.

The CPU usage alert will just work, but you'll need to enable watcher for the rest to work which means you'll need to on the trial license (or gold+). After doing that, you can verify they exist by going to Stack Management -> Watches and they should show up there (but keep in mind this bug will require you to enable legacy monitoring for watches to exist).

The harder part is actually getting your cluster in a state to trigger the various alerts.

CPU Usage

To simulate this, edit the threshold by enabling setup mode on the cluster overview page, clicking on the Alerts badge on the Nodes panel and editing the cpu usage alert configuration to some low value you can easily reach on your machine.

License expiration

For this, I've been adding an ingest pipeline to simulate an early expiration.

See https://gist.github.com/chrisronline/9d4d3d740e535d3c01410cac2cc74653

Cluster status

For this, simply create an index and add a document. This should trigger the alert indicating you need to allocate replica shards.

ES nodes change

For this, I found it easier to simulate with an ingest pipeline

See https://gist.github.com/chrisronline/d441aba1a08cb45082e59f39cc9f6687

Elasticsearch version mismatch

This is not testable, as the watch itself is broken. See elastic/elasticsearch#58261

Kibana version mismatch

Again, I found this easier to simulate with an ingest pipeline.

See https://gist.github.com/chrisronline/34328d14738f0ce754e36ec7031e45a9

Logstash version mismatch

Again, I found this easier to simulate with an ingest pipeline.

See https://gist.github.com/chrisronline/3b982d95710ef820d11c7443a1e49091

TODO

  • Fix/add tests
  • Localize all
  • Figure out how to message for legacy alerts that email might already be configured (they should know though right?)
  • Add alert data to each relevant status bar
  • Write up cloud instructions
  • Test CPU usage in containerized environment
  • Add mute/disable options in the panel
  • Fix issue with not using existing state within the alert instance
  • Ensure each email default message has a kibana link back
  • How can we ensure we can make updates to the email/server log messaging?

@chrisronline
Copy link
Contributor Author

cc @gchaps for help with the copy here

Let me know if it's easier if I just prepare the copy in a doc or something

@gchaps
Copy link
Contributor

gchaps commented Jul 13, 2020

@chrisronline Are all the screenshots included in the summary? If so, I can go by that. Might be best to set up a meeting and go through the copy together.

@chrisronline
Copy link
Contributor Author

chrisronline commented Jul 14, 2020

@gchaps

Here is the copy we need help with. This might be a bit confusing, so happy to clarify if necessary

Cluster health alert

Params

{state} - The current state of the alert.'
{clusterHealth} - The health of the cluster
{clusterName} - The name of the cluster to which the nodes belong.
{action} - The recommended action to take based on this alert firing

EDIT:
{clusterHealth}: The health of the cluster. (add ending period)
{clusterName: The cluster to which the nodes belong.
{action}: The recommended action for this alert.

Possible param values

{state}

  • firing
  • resolved

{action}

  • Allocate missing primary and replica shards.

UI messaging

When firing

Elasticsearch cluster health is {clusterHealth}.

When resolved

Elasticsearch cluster health is green.

Email

Subject: Cluster health alert is {state} for {clusterName}. Current health is {clusterHealth}.
Body: {action}

EDIT:
Cluster health is {state} for {clusterName}. Current health is {clusterHealth}.

Server log

Cluster health alert is {state} for {clusterName}. Current health is {clusterHealth}. {action}


CPU usage alert

Params

{state} - The current state of the alert.'
{nodes} - The list of nodes that are reporting high cpu usage.
{count} - The number of nodes that are reporting high cpu usage.'
{clusterName} - The name of the cluster to which the nodes belong.'
{action} - The recommended action to take based on this alert firing

EDIT:

  • {nodes}: The list of nodes reporting high CPU usage.
  • {count}: The number of nodes reporting high CPU usage.
  • {clusterName}: The cluster to which the nodes belong.
    -{action} - The recommended action for this alert. (be sure to add ending period).

Possible param values

{state}

  • firing
  • resolved

{action}

  • Investigate (links into Kibana)

UI messaging

When firing

The cpu usage on node {nodeName} is now under the threshold, currently reporting at {cpuUsage}% as of #resolved

EDIT
The CPU usage on node {nodeName} is under the threshold. Usage is {cpuUsage}% as of #resolved.

When resolved

Elasticsearch cluster health is green.

Email

Subject: CPU usage alert is {state} for {count} node(s) in cluster: {clusterName}
Body: {action}

EDIT
Subject: CPU usage is {state} for {count} node(s) in cluster {clusterName} (Can we remove the word alert? Also colon before clusterName.

Server log

CPU usage alert is {state} for {count} node(s) in cluster: {clusterName}


Elasticsearch version mismatch alert

Params

{state} - The current state of the alert.'
{clusterName} - The name of the cluster to which the nodes belong.'
{versionList} - The list of unique versions.

EDIT
{clusterName} - The cluster to which the nodes belong.'
{versionList} - The versions of Elasticsearch running in this cluster. (may not be right, but should define what versions are being talked about).

Possible param values

{state}

  • firing
  • resolved

UI messaging

When firing

There are different versions of Elasticsearch ({versions}) running in this cluster.

EDIT
Multiple versions of Elasticsearch ({versions}) are running in this cluster.`

When resolved

All versions are the same for Elasticsearch in this cluster.

EDIT
All versions of Elasticsearch are the same in this cluster.

Email

Subject: Elasticsearch version mismatch alert is {state} for {clusterName}
Body: Elasticsearch is running {versionList} in {clusterName}.

EDIT
Subject: Multiple versions of Elasticsearch running in {clusterName}. (If not ok to remove alert state, leave as is)

Server log

Elasticsearch version mismatch alert is {state} for {clusterName}


Kibana version mismatch alert

Params

{state} - The current state of the alert.'
{clusterName} - The name of the cluster to which the instances belong.'
{versionList} - The list of unique versions.

EDIT
{clusterName} - The cluster to which the Kibana instances belong. (Instances is vague by itself. Is Kibana right here?)
{versionList} - The versions of Kibana running in this cluster.

Possible param values

{state}

  • firing
  • resolved

UI messaging

When firing

There are different versions of Kibana ({versions}) running in this cluster.

EDIT
Multiple versions of Kibana ({versions}) are running in this cluster.

When resolved

All versions are the same for Kibana in this cluster.

EDIT
All versions of Kibana are the same for this cluster.

Email

Subject: Kibana version mismatch alert is {state} for {clusterName}
Body: Kibana running {versionList} in {clusterName}.

EDIT
Subject: `Multiple versions of Kibana running in {clusterName}

Server log

Kibana version mismatch alert is {state} for {clusterName}


Logstash version mismatch alert

Params

{state} - The current state of the alert.'
{clusterName} - The name of the cluster to which the nodes belong.'
{versionList} - The list of unique versions.

EDIT
{clusterName} - The cluster to which the nodes belong.
{versionList} - The versions of Logstash running in this cluster.

Possible param values

{state}

  • firing
  • resolved

UI messaging

When firing

There are different versions of Logstash ({versions}) running in this cluster.

EDIT
Multiple versions of Logstash ({versions}) are running in this cluster.

When resolved

All versions are the same for Logstash in this cluster.

EDIT
All versions of Logstash are the same for this cluster.

Email

Subject: Logstash version mismatch alert is {state} for {clusterName}
Body: Logstash running {versionList} in {clusterName}.

EDIT
Subject: Multiple versions of Logstash running in {clusterName}

Server log

Logstash version mismatch alert is {state} for {clusterName}


License expiration alert

Params

{state} - The current state of the alert.'
{expiredDate} - The date when the license expires.
{action} - The recommended action to take based on this alert firing.
{clusterName} - The name of the cluster to which the nodes belong.'

EDIT
{action} - The recommended action for this alert.
{clusterName} - The cluster to which the nodes belong.

Possible param values

{state}

  • firing
  • resolved

UI messaging

When firing

This cluster's license is going to expire in #relative at #absolute. #start_linkPlease update your license.#end_link

EDIT
The license for this cluster expires in #relative at #absolute. #start_linkPlease update your license.#end_link

When resolved

This cluster's license is active.

EDIT
The license for this cluster is active.

Email

Body: Your license will expire in {expiredDate} for {clusterName}. {action}

EDIT
Body: Your license for {clusterName} expires in {expiredDate}. {action}

Server log

License expiration alert is {state} for {clusterName}. Your license will expire in {expiredDate}. {action}

EDIT
License expiration alert is {state} for {clusterName}. Your license expires in {expiredDate}. {action}


ES nodes changed alert

Params

{state} - The current state of the alert.
{clusterName} - The name of the cluster to which the nodes belong.
{added} - The list of nodes added to the cluster.
{removed} - The list of nodes removed from the cluster.
{restarted} - The list of nodes restarted in the cluster.

EDIT
{clusterName} - The cluster to which the nodes belong.
{added} - The nodes added to the cluster.
{removed} - The nodes removed from the cluster.
{restarted} - The nodes restarted in the cluster.

Possible param values

{state}

  • firing
  • resolved

UI messaging

When firing

Note: These will come out side by side if more than one apply.
Elasticsearch nodes '{added}' added to this cluster.
Elasticsearch nodes '{removed}' removed from this cluster.
Elasticsearch nodes '{restarted}' restarted in this cluster.

When resolved

No changes detected in Elasticsearch nodes for this cluster.

EDIT
No changes in Elasticsearch nodes for this cluster.

Email

Subject: Elasticsearch nodes changed alert is {state} for {clusterName}
Body: The following Elasticsearch nodes in {clusterName} have been added:{added} removed:{removed} restarted:{restarted}

EDIT

Subject: Elasticsearch nodes changed for {clusterName}

Can you format the body as follows:

The following Elasticsearch nodes changed in {clusterName}:

  • Added:{added}
  • Removed:{removed}
  • Restarted:{restarted}

Server log

Elasticsearch nodes changed alert is {state} for {clusterName}

Copy link
Contributor

@igoristic igoristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good 👍

@elastic elastic deleted a comment from kibanamachine Jul 14, 2020
@elastic elastic deleted a comment from kibanamachine Jul 14, 2020
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

@kbn/optimizer bundle module count

id value diff baseline
monitoring 217 +15 202

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@chrisronline chrisronline merged commit 06b1820 into elastic:master Jul 14, 2020
@chrisronline chrisronline deleted the monitoring/guardrails branch July 14, 2020 22:00
chrisronline added a commit that referenced this pull request Jul 15, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
* First draft, not quite working but a good start

* More working

* Support configuring throttle

* Get the other alerts working too

* More

* Separate into individual files

* Menu support as well as better integration in existing UIs

* Red borders!

* New overview style, and renamed alert

* more visual updates

* Update cpu usage and improve settings configuration in UI

* Convert cluster health and license expiration alert to use legacy data model

* Remove most of the custom UI and use the flyout

* Add the actual alerts

* Remove more code

* Fix formatting

* Fix up some errors

* Remove unnecessary code

* Updates

* add more links here

* Fix up linkage

* Added nodes changed alert

* Most of the version mismatch working

* Add kibana mismatch

* UI tweaks

* Add timestamp

* Support actions in the enable api

* Move this around

* Better support for changing legacy alerts

* Add missing files

* Update alerts

* Enable alerts whenever any page is visited in SM

* Tweaks

* Use more practical default

* Remove the buggy renderer and ensure setup mode can show all alerts

* Updates

* Remove unnecessary code

* Remove some dead code

* Cleanup

* Fix snapshot

* Fixes

* Fixes

* Fix test

* Add alerts to kibana and logstash listing pages

* Fix test

* Add disable/mute options

* Tweaks

* Fix linting

* Fix i18n

* Adding a couple tests

* Fix localization

* Use http

* Ensure we properly handle when an alert is resolved

* Fix tests

* Hide legacy alerts if not the right license

* Design tweaks

* Fix tests

* PR feedback

* Moar tests

* Fix i18n

* Ensure we have a control over the messaging

* Fix translations

* Tweaks

* More localization

* Copy changes

* Type
# Conflicts:
#	x-pack/plugins/monitoring/common/constants.ts
#	x-pack/plugins/monitoring/public/components/cluster/overview/alerts_panel.js
#	x-pack/plugins/monitoring/public/components/cluster/overview/index.js
#	x-pack/plugins/monitoring/public/components/elasticsearch/node/node.js
#	x-pack/plugins/monitoring/public/components/elasticsearch/nodes/nodes.js
#	x-pack/plugins/monitoring/public/components/kibana/instances/instances.js
#	x-pack/plugins/monitoring/server/plugin.ts
#	x-pack/test/functional/apps/monitoring/cluster/alerts.js
@chrisronline
Copy link
Contributor Author

Backport

7.x: 510a684

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants