[cli][metrics] metric alarms configurable through the cli #302

ryandeivert · 2017-09-08T17:04:52Z

to @austinbyers or @chunyong-lin
cc @airbnb/streamalert-maintainers
size: large

Background & Why

Context on recent developments:
- StreamAlert now has support for metric logging with the merge of [metrics] v2 of metrics support using metric filters #282 ([metrics] v2 of metrics support using metric filters).
- The usage of filter patterns for metrics allows us to cheaply track whatever sort of metric we want to by simply logging certain messages to logger.
- Adding a new metric to be tracked involves adding it to the stream_alert/shared/metrics.py package to be used throughout the project.
Any predefined metric can now have alarms associated with it to allow for notifications if something unexpected is occurring.
For instance, if the number of FailedParses (aka incoming logs that do not match a defined schema) rises above a certain threshold, CloudWatch can fire an alarm that then notifies any service connected to the alarm.
- The alarm currently sends to an SNS topic that is either designated by the user or chose by default by StreamAlert.

Changes

CloudWatch alarms for all predefined metrics we use are now configurable through the manage.py cli.
A alarm that operates against a specific cluster's metric is configurable like so (note the use of --metric-target cluster):

    python manage.py create-alarm \
      --metric FailedParses \
      --metric-target cluster \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --alarm-name 'Prod - Failed Parses Alarm' \
      --evaluation-periods 1 \
      --period 300 \
      --threshold 1.0 \
      --alarm-description 'Alarm for any failed parses that occur within a 5 minute period in the prod cluster' \
      --clusters prod \
      --statistic Sum

A alarm that operates against the aggregate metric (across all clusters) is configurable like so (note the use of --metric-target aggregate):

    python manage.py create-alarm \
      --metric FailedParses \
      --metric-target aggregate \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --alarm-name 'Aggregate - Failed Parses Alarm' \
      --evaluation-periods 1 \
      --period 300 \
      --threshold 1.0 \
      --alarm-description 'Aggregate alarm for any failed parses that occur within a 5 minute period in any cluster' \
      --statistic Sum

Alarms current support one action, and that is sending to the SNS topic defined in conf/global.json (or the default SNS topic of stream_alert_monitoring if one is not defined).

Other changes

Adding a metric for TOTAL_PROCESSED_SIZE that will log the processed bytes for each rule processor invocation.
The manage.py cli also now supports turning on or off metrics for a given cluster/function:
manage.py metrics --enable --functions rule
- An optional list of --clusters can be added to only enable for specific clusters. By default without --clusters this will enable metrics for all clusters for the given function

Testing

There are no big changes to the core library/functions, but these changes have been deployed in a test AWS account to ensure the alarms get properly created:

+ aws_cloudwatch_metric_alarm.metric_alarm_AggregateFailedParsesAlarm
    actions_enabled:                       "true"
    alarm_actions.#:                       "1"
    alarm_actions.1304590987:              "arn:aws:sns:us-east-1:454267907943:stream_alert_monitoring"
    alarm_description:                     "Aggregate alarm for any failed parses that occur within a 5 minute period in any cluster"
    alarm_name:                            "Aggregate - Failed Parses Alarm"
    comparison_operator:                   "GreaterThanOrEqualToThreshold"
    evaluate_low_sample_count_percentiles: "<computed>"
    evaluation_periods:                    "1"
    metric_name:                           "RuleProcessor-FailedParses"
    namespace:                             "StreamAlert"
    period:                                "300"
    statistic:                             "Sum"
    threshold:                             "1"
    treat_missing_data:                    "missing"

+ module.stream_alert_prod.aws_cloudwatch_metric_alarm.cw_metric_alarms
    actions_enabled:                       "true"
    alarm_actions.#:                       "1"
    alarm_actions.1304590987:              "arn:aws:sns:us-east-1:454267907943:stream_alert_monitoring"
    alarm_description:                     "Alarm for any failed parses that occur within a 5 minute period in the prod cluster"
    alarm_name:                            "Prod - Failed Parses Alarm"
    comparison_operator:                   "GreaterThanOrEqualToThreshold"
    evaluate_low_sample_count_percentiles: "<computed>"
    evaluation_periods:                    "1"
    metric_name:                           "RuleProcessor-FailedParses-PROD"
    namespace:                             "StreamAlert"
    period:                                "300"
    statistic:                             "Sum"
    threshold:                             "1"
    treat_missing_data:                    "missing"

coveralls · 2017-09-08T19:03:23Z

Coverage increased (+0.02%) to 95.295% when pulling 99eb49c on ryandeivert-metric-alarms-cli into 8f07593 on master.

chunyong-lin · 2017-09-08T19:34:43Z

conf/clusters/prod.json

@@ -29,6 +29,7 @@
      },
      "rule_processor": {
        "current_version": "$LATEST",
+        "enable_metrics": true,


In doc (docs/source/athena-deploy.rst), it says default value is false. So which one if outdate? 😄

Docs definitely - I do not have time to add those to this PR

Also that's related to the athena processor not the rule processor or alert processor

austinbyers

A few comments, but this looks good! The extended validation of the user input will make for a much better user experience instead of trying to deploy and waiting for Terraform to fail

austinbyers · 2017-09-08T22:16:52Z

manage.py

+    """Subclass of argparse.Action to avoid multiple of the same choice from a list"""
+    def __call__(self, parser, namespace, values, option_string=None):
+        unique_items = set(values)
+        setattr(namespace, self.dest, unique_items)


Cool! Some advanced argparsing I haven't seen before

austinbyers · 2017-09-08T22:22:33Z

manage.py

+                                   GreaterThanOrEqualToThreshold
+                                   GreaterThanThreshold
+                                   LessThanThreshold
+                                   LessThanOrEqualToThreshold


It would be more user-friendly if you allow >=, >, <, <= and convert to the longer form, like you do for the function names

That's smart - I was just using aws terms as a template but I like that approach better

On second thought I think I'll leave it as is - using $ python manage.py create-alarm --help will show the acceptable options.

Also if an invalid param is used, it will print the options like: manage.py [command] [subcommand] [options] create-alarm: error: argument -co/--comparison-operator: invalid choice: 'GreaterThanOrEqualToThresho' (choose from 'GreaterThanOrEqualToThreshold', 'GreaterThanThreshold', 'LessThanThreshold', 'LessThanOrEqualToThreshold')

A good thought but I think I'd just like to mirror the AWS options :)

austinbyers · 2017-09-08T22:23:19Z

manage.py

+    )
+
+    # Set the name of this parser to 'validate-schemas'
+    metric_alarm_parser.set_defaults(command='create-alarm')


Comment does not match the code

austinbyers · 2017-09-08T22:29:05Z

terraform/modules/tf_stream_alert/main.tf

+// CloudWatch metric alarms that are created per-cluster
+// The split list is our way around poor tf support for lists of maps and is made up of:
+// <alarm_name>, <alarm_description>, <comparison_operator>, <evaluation_periods>,
+// <metric>, <period>, <statistic>, <threshold>


This seems like a pretty good workaround. Maybe add a TODO so we can come back and update this once Terraform has better support for complex types

* Fixing bug in help string for validate-schemas command

…ired

…global option

* Adding some flags to the `manage.py metrics` command to accept a cluster and a function name * By default, function name is required while cluster is not (will default to all clusters)

…ter files * Metric alarms related to aggregate metrics get saved in the conf/global.json file * Metric alarms related to the athena function get saved in the conf/global.json file * Metric alarms related to the rule or alert processor functions get save to each cluster file. This enables us to better organize metrics and allows the user to set different alarms for different clusters.

…same name

* Terraform is awful and supporting both statistic and extended-statistic programatically is much too difficult right now. * Updating the property name for `metric` to be `metric_name` to align with what tf expects

… make it simpler * The terraform generate code now completely supports writing cloudwatch alarms for both aggregate metric alarms and alarms per cluster/fucntion * Aggregate metric alarms are written to the `main.tf` file to be published. * Per-cluster/function metric alarms are done via the `stream_alert` tf module.

* Updating unit test * Migarting a constant dict to metrics pacakge to be shared * Various linting

coveralls · 2017-09-09T00:06:09Z

Coverage increased (+0.02%) to 95.295% when pulling e427d16 on ryandeivert-metric-alarms-cli into d54ad1a on master.

ryandeivert force-pushed the ryandeivert-metric-alarms-cli branch 2 times, most recently from 22bcdaa to 99eb49c Compare September 8, 2017 18:57

airbnb deleted a comment from coveralls Sep 8, 2017

chunyong-lin reviewed Sep 8, 2017

View reviewed changes

austinbyers approved these changes Sep 8, 2017

View reviewed changes

ryandeivert added 15 commits September 8, 2017 16:44

[metrics] adding total processed size to metrics

6cda978

[cli] cli support for adding metric alarms for predefined metrics

9851629

* Fixing bug in help string for validate-schemas command

[cli] writing cloudwatch alarm info to global config

451694b

[cli] adding logic to allow users to overwrite existing alarms if des…

3504c50

…ired

[cli] adding ability to toggle metrics via the cli

a9f9784

[metrics] making metrics configurable per function/cluster and not a …

0f04730

…global option

[cli] simplifying some monitoring code

860190c

[cli] updating cli to support toggling metrics per cluster/function

38175cb

* Adding some flags to the `manage.py metrics` command to accept a cluster and a function name * By default, function name is required while cluster is not (will default to all clusters)

[cli] adding class to normalize chosen functions to constant names

1857628

[cli][metrics] adding hard check to prevent multiple alarms with the …

f805cab

…same name

[cli] removing ability to use extended-statistics in alarms for now

ac7f73c

* Terraform is awful and supporting both statistic and extended-statistic programatically is much too difficult right now. * Updating the property name for `metric` to be `metric_name` to align with what tf expects

[cli] fixing bug that with sns topic and proper aggregate metric name

6a45c87

* Updating unit test * Migarting a constant dict to metrics pacakge to be shared * Various linting

[pr] addressing feedback from @austinbyers

e427d16

ryandeivert force-pushed the ryandeivert-metric-alarms-cli branch from d8ae5f4 to e427d16 Compare September 8, 2017 23:44

ryandeivert added this to the 1.5.0 milestone Sep 10, 2017

ryandeivert merged commit d3ac7f3 into master Sep 11, 2017

ryandeivert deleted the ryandeivert-metric-alarms-cli branch September 11, 2017 18:48

ryandeivert added the metrics label Sep 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cli][metrics] metric alarms configurable through the cli #302

[cli][metrics] metric alarms configurable through the cli #302

ryandeivert commented Sep 8, 2017 •

edited

Loading

coveralls commented Sep 8, 2017

chunyong-lin Sep 8, 2017

ryandeivert Sep 8, 2017

ryandeivert Sep 8, 2017

austinbyers left a comment

austinbyers Sep 8, 2017

austinbyers Sep 8, 2017

ryandeivert Sep 8, 2017

ryandeivert Sep 8, 2017

austinbyers Sep 8, 2017

austinbyers Sep 8, 2017

coveralls commented Sep 9, 2017

[cli][metrics] metric alarms configurable through the cli #302

[cli][metrics] metric alarms configurable through the cli #302

Conversation

ryandeivert commented Sep 8, 2017 • edited Loading

Background & Why

Changes

Other changes

Testing

coveralls commented Sep 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Sep 9, 2017

ryandeivert commented Sep 8, 2017 •

edited

Loading