Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cli][metrics] metric alarms configurable through the cli #302

Merged
merged 15 commits into from
Sep 11, 2017

Conversation

ryandeivert
Copy link
Contributor

@ryandeivert ryandeivert commented Sep 8, 2017

to @austinbyers or @chunyong-lin
cc @airbnb/streamalert-maintainers
size: large

Background & Why

  • Context on recent developments:
    • StreamAlert now has support for metric logging with the merge of [metrics] v2 of metrics support using metric filters #282 ([metrics] v2 of metrics support using metric filters).
    • The usage of filter patterns for metrics allows us to cheaply track whatever sort of metric we want to by simply logging certain messages to logger.
    • Adding a new metric to be tracked involves adding it to the stream_alert/shared/metrics.py package to be used throughout the project.
  • Any predefined metric can now have alarms associated with it to allow for notifications if something unexpected is occurring.
  • For instance, if the number of FailedParses (aka incoming logs that do not match a defined schema) rises above a certain threshold, CloudWatch can fire an alarm that then notifies any service connected to the alarm.
    • The alarm currently sends to an SNS topic that is either designated by the user or chose by default by StreamAlert.

Changes

  • CloudWatch alarms for all predefined metrics we use are now configurable through the manage.py cli.
  • A alarm that operates against a specific cluster's metric is configurable like so (note the use of --metric-target cluster):
    python manage.py create-alarm \
      --metric FailedParses \
      --metric-target cluster \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --alarm-name 'Prod - Failed Parses Alarm' \
      --evaluation-periods 1 \
      --period 300 \
      --threshold 1.0 \
      --alarm-description 'Alarm for any failed parses that occur within a 5 minute period in the prod cluster' \
      --clusters prod \
      --statistic Sum
  • A alarm that operates against the aggregate metric (across all clusters) is configurable like so (note the use of --metric-target aggregate):
    python manage.py create-alarm \
      --metric FailedParses \
      --metric-target aggregate \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --alarm-name 'Aggregate - Failed Parses Alarm' \
      --evaluation-periods 1 \
      --period 300 \
      --threshold 1.0 \
      --alarm-description 'Aggregate alarm for any failed parses that occur within a 5 minute period in any cluster' \
      --statistic Sum
  • Alarms current support one action, and that is sending to the SNS topic defined in conf/global.json (or the default SNS topic of stream_alert_monitoring if one is not defined).

Other changes

  • Adding a metric for TOTAL_PROCESSED_SIZE that will log the processed bytes for each rule processor invocation.
  • The manage.py cli also now supports turning on or off metrics for a given cluster/function:
    manage.py metrics --enable --functions rule
    • An optional list of --clusters can be added to only enable for specific clusters. By default without --clusters this will enable metrics for all clusters for the given function

Testing

  • There are no big changes to the core library/functions, but these changes have been deployed in a test AWS account to ensure the alarms get properly created:
+ aws_cloudwatch_metric_alarm.metric_alarm_AggregateFailedParsesAlarm
    actions_enabled:                       "true"
    alarm_actions.#:                       "1"
    alarm_actions.1304590987:              "arn:aws:sns:us-east-1:454267907943:stream_alert_monitoring"
    alarm_description:                     "Aggregate alarm for any failed parses that occur within a 5 minute period in any cluster"
    alarm_name:                            "Aggregate - Failed Parses Alarm"
    comparison_operator:                   "GreaterThanOrEqualToThreshold"
    evaluate_low_sample_count_percentiles: "<computed>"
    evaluation_periods:                    "1"
    metric_name:                           "RuleProcessor-FailedParses"
    namespace:                             "StreamAlert"
    period:                                "300"
    statistic:                             "Sum"
    threshold:                             "1"
    treat_missing_data:                    "missing"

+ module.stream_alert_prod.aws_cloudwatch_metric_alarm.cw_metric_alarms
    actions_enabled:                       "true"
    alarm_actions.#:                       "1"
    alarm_actions.1304590987:              "arn:aws:sns:us-east-1:454267907943:stream_alert_monitoring"
    alarm_description:                     "Alarm for any failed parses that occur within a 5 minute period in the prod cluster"
    alarm_name:                            "Prod - Failed Parses Alarm"
    comparison_operator:                   "GreaterThanOrEqualToThreshold"
    evaluate_low_sample_count_percentiles: "<computed>"
    evaluation_periods:                    "1"
    metric_name:                           "RuleProcessor-FailedParses-PROD"
    namespace:                             "StreamAlert"
    period:                                "300"
    statistic:                             "Sum"
    threshold:                             "1"
    treat_missing_data:                    "missing"

@ryandeivert ryandeivert force-pushed the ryandeivert-metric-alarms-cli branch 2 times, most recently from 22bcdaa to 99eb49c Compare September 8, 2017 18:57
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 95.295% when pulling 99eb49c on ryandeivert-metric-alarms-cli into 8f07593 on master.

@airbnb airbnb deleted a comment from coveralls Sep 8, 2017
@@ -29,6 +29,7 @@
},
"rule_processor": {
"current_version": "$LATEST",
"enable_metrics": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In doc (docs/source/athena-deploy.rst), it says default value is false. So which one if outdate? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs definitely - I do not have time to add those to this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that's related to the athena processor not the rule processor or alert processor

Copy link
Contributor

@austinbyers austinbyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but this looks good! The extended validation of the user input will make for a much better user experience instead of trying to deploy and waiting for Terraform to fail

"""Subclass of argparse.Action to avoid multiple of the same choice from a list"""
def __call__(self, parser, namespace, values, option_string=None):
unique_items = set(values)
setattr(namespace, self.dest, unique_items)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Some advanced argparsing I haven't seen before

GreaterThanOrEqualToThreshold
GreaterThanThreshold
LessThanThreshold
LessThanOrEqualToThreshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more user-friendly if you allow >=, >, <, <= and convert to the longer form, like you do for the function names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's smart - I was just using aws terms as a template but I like that approach better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought I think I'll leave it as is - using $ python manage.py create-alarm --help will show the acceptable options.

Also if an invalid param is used, it will print the options like: manage.py [command] [subcommand] [options] create-alarm: error: argument -co/--comparison-operator: invalid choice: 'GreaterThanOrEqualToThresho' (choose from 'GreaterThanOrEqualToThreshold', 'GreaterThanThreshold', 'LessThanThreshold', 'LessThanOrEqualToThreshold')

A good thought but I think I'd just like to mirror the AWS options :)

manage.py Outdated
)

# Set the name of this parser to 'validate-schemas'
metric_alarm_parser.set_defaults(command='create-alarm')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment does not match the code

// CloudWatch metric alarms that are created per-cluster
// The split list is our way around poor tf support for lists of maps and is made up of:
// <alarm_name>, <alarm_description>, <comparison_operator>, <evaluation_periods>,
// <metric>, <period>, <statistic>, <threshold>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty good workaround. Maybe add a TODO so we can come back and update this once Terraform has better support for complex types

* Fixing bug in help string for validate-schemas command
* Adding some flags to the `manage.py metrics` command to accept a cluster and a function name
* By default, function name is required while cluster is not (will default to all clusters)
…ter files

* Metric alarms related to aggregate metrics get saved in the conf/global.json file
* Metric alarms related to the athena function get saved in the conf/global.json file
* Metric alarms related to the rule or alert processor functions get save to each cluster
  file. This enables us to better organize metrics and allows the user to set different
  alarms for different clusters.
* Terraform is awful and supporting both statistic and extended-statistic
  programatically is much too difficult right now.
* Updating the property name for `metric` to be `metric_name` to align with what tf expects
… make it simpler

* The terraform generate code now completely supports writing cloudwatch alarms for both
  aggregate metric alarms and alarms per cluster/fucntion
* Aggregate metric alarms are written to the `main.tf` file to be published.
* Per-cluster/function metric alarms are done via the `stream_alert` tf module.
* Updating unit test
* Migarting a constant dict to metrics pacakge to be shared
* Various linting
@ryandeivert ryandeivert force-pushed the ryandeivert-metric-alarms-cli branch from d8ae5f4 to e427d16 Compare September 8, 2017 23:44
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 95.295% when pulling e427d16 on ryandeivert-metric-alarms-cli into d54ad1a on master.

@ryandeivert ryandeivert added this to the 1.5.0 milestone Sep 10, 2017
@ryandeivert ryandeivert merged commit d3ac7f3 into master Sep 11, 2017
@ryandeivert ryandeivert deleted the ryandeivert-metric-alarms-cli branch September 11, 2017 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants