Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] drop sns publish for lambda invoke #225

Merged
merged 14 commits into from
Jul 18, 2017

Conversation

ryandeivert
Copy link
Contributor

@ryandeivert ryandeivert commented Jul 14, 2017

to @austinbyers
cc @jacknagz , @airbnb/dashed-maintainers
size: large
resolves: #189

Why

  • We have been noticing an increased number of SNS message failures in cloudwatch, which means potentially meaningful alerts are being lost in transit between the Rule Processor and the Alert Processor. We currently have no insight into why these failure occur.
  • Switching to lambda, with an asynchronous invocation, provides the benefit of automatic retry logic (see: here). The invocation will now retry twice if there was a failure.
  • This greatly simplifies how the Rule Processor and Alert Processor interact, removing a good amount of previously necessary logic.

Changes in collab branch with @jacknagz

  • Migrating to using direct lambda invocation of the alert processor function from the rule processor function.
  • This is an async invocation ('Event' type) and sends the payload directly to the alert processor lambda (no 'sns' formatting required).
  • The alert processor now receives the payload directly in the event parameter of the lambda handler.
  • Updating the alert processor to perform some input validation of the alert before processing occurs. This should avoid errors due to a malformed payload.
    • Specific errors indicating what is wrong with the input payload will be logged as errors, and the alert payload itself will be reported after all other errors..
  • Updating all unit tests and add new ones.
  • Updating integration tests to conform to other updates.
  • Thanks to @jacknagz for the necessary terraform, terraform generation, and doc updates. (please fill in details here as you see fit..).

Fixes

  • Resolving bug where importing pip as a module broke the CLI logging. The fix is in this commit and is described here.

To consider:

  • Setting up a DLQ (Dead Letter Queue) for our lambda functions that would give us insight into any lambda invocation failures in the future.
  • Checking the payload size being sent in the Alert Processor lambda invocation, and segmenting messages if it exceeds the limit (see: here).

@coveralls
Copy link

Coverage Status

Coverage increased (+3.5%) to 82.922% when pulling 3485350 on ryandeivert-drop-sns-publish-for-lambda-invoke into c50856c on master.

@ryandeivert ryandeivert added this to the 1.5.0 milestone Jul 14, 2017
Copy link
Contributor

@jacknagz jacknagz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple inline comments

data)
return

if response['ResponseMetadata']['HTTPStatusCode'] != 202:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this block slightly repetitive to the preceding one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific? What's repetitive about it (and are you referring to the try/except above?)?

If so, I think an HTTP response of non-202 could possibly occur even if there is no ClientError occurred during invocation. I could also be off on that.. We could also add FunctionError to this error logging.

continue

status_values.extend(run(loaded_sns_message, region, function_name, config))
# Yield back the current status to the caller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment be a little more descriptive?

}
}
}

region [string]: the AWS region being used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update this to: The AWS region of the currently executing Lambda function

Copy link
Contributor

@austinbyers austinbyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be awesome!

if not set(alert['metadata'].keys()) == metadata_keys:
LOGGER.error('The value of the \'metadata\' key must be a map (dict) '
'that contains the following keys: %s',
', '.join('\'{}\''.format(key) for key in metadata_keys))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these comparisons in this file test for set equality (the keys must match the hard-coded set), but the errors messages say it must contain such and such keys.

Just to mitigate potential confusion, can we either update the error messages to indicate it must contain exactly the required keys, or else update the tests to check for a set inclusion instead of equality?

Copy link
Contributor Author

@ryandeivert ryandeivert Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@austinbyers making the logger messages indication that only those keys should be included. would rather have a strict validation of input instead of a loose validation.

LOGGER.error('The value of the \'%s\' key within \'%s\' must be '
'a string (str).', entry, key)
valid = False
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant (at the end of the inner for loop)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! removed.

LOGGER.error('The value of each entry in the \'outputs\' list '
'must be a string (str).')
valid = False
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! removed.

LOGGER.error('The value of the \'%s\' key must be a string (str), not %s',
key, type(alert['metadata'][key]))
valid = False
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically redundant, but if you want to keep this one to be consistent with all the other checks, that's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! removed for brevity.

LOGGER.error('An error occurred while decoding message to JSON: %s', err)
continue
# Return the current list of statuses back to the caller
return [status for status in run(event, region, function_name, config)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would return list(run(event, region, function_name, config)) do the trick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same effect yes. I'll update with explicit list casting.

resource "aws_sns_topic_subscription" "input_topic_subscriptions" {
count = "${length(var.input_sns_topics)}"
topic_arn = "${element(var.input_sns_topics, count.index)}"
endpoint = "${aws_lambda_function.streamalert_rule_processor.arn}:production"
protocol = "lambda"
}

// Lambda Permission: Allow SNS to invoke the Rule Processor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - how did we not have this before? SNS is currently alerting the rule processor just fine...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had it, it just moved into this file from here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

invalid_metadata_non_string['metadata']['type'] = 4.5

# Test with invalid metadata non-string value
assert_false(validate_alert(invalid_metadata_non_string))
Copy link
Contributor

@austinbyers austinbyers Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break each of these up into smaller unit tests? Each assert condition with a comment should be its own self-contained test.

import json
import logging
import sys

import boto3
from botocore.exceptions import ClientError

_SNS_MAX_SIZE = (256*1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a maximum Lambda invocation size (6MB for asynchronous requests), but I don't think we'll be anywhere close to that limit. Just something to keep in mind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment to the PR body for a future consideration.

@ryandeivert
Copy link
Contributor Author

@jacknagz @austinbyers PTAL at new commit and comments

@coveralls
Copy link

Coverage Status

Coverage increased (+3.5%) to 82.873% when pulling 21cb539 on ryandeivert-drop-sns-publish-for-lambda-invoke into c50856c on master.

Copy link
Contributor

@austinbyers austinbyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

LOGGER.error('An error occurred while decoding message to JSON: %s', err)
continue
# Return the current list of statuses back to the caller
return list(status for status in run(event, region, function_name, config))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this further:

return list(run(event, region, function_name, config))

list will automatically transform the generator into a list by iterating through it; you don't have to manually iterate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry overlooked that bit.

resource "aws_sns_topic_subscription" "input_topic_subscriptions" {
count = "${length(var.input_sns_topics)}"
topic_arn = "${element(var.input_sns_topics, count.index)}"
endpoint = "${aws_lambda_function.streamalert_rule_processor.arn}:production"
protocol = "lambda"
}

// Lambda Permission: Allow SNS to invoke the Rule Processor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

'"metadata": {"source": {"service": "payload_service_01", ' \
'"entity": "payload_entity_01"}, "rule_name": "test_rule_01", ' \
'"type": "payload_type_01", "log": "payload_data_01", ' \
'"outputs": "rule.outputs_01"}}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the line continuation markers (\) (this is discouraged by Google Python style guide); Python has implicit string joining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in py 2 :) you need to wrap in parenths for that to be effective, but I can add that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, you always need the parens (in both py2 and py3). Good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@austinbyers fixed in latest commit.. as a param in a function call it looks like it doesn't need to be encapsulated

@ryandeivert ryandeivert force-pushed the ryandeivert-drop-sns-publish-for-lambda-invoke branch from d0de948 to 5c54db5 Compare July 17, 2017 21:17
@airbnb airbnb deleted a comment from coveralls Jul 17, 2017
@coveralls
Copy link

Coverage Status

Coverage increased (+3.5%) to 82.873% when pulling bf8433e on ryandeivert-drop-sns-publish-for-lambda-invoke into c50856c on master.

@airbnb airbnb deleted a comment from coveralls Jul 17, 2017
@ryandeivert ryandeivert merged commit 51ea4c1 into master Jul 18, 2017
@ryandeivert ryandeivert deleted the ryandeivert-drop-sns-publish-for-lambda-invoke branch July 18, 2017 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement: Invoke alert processor directly from rule processor
4 participants