Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create alert merger which dispatches alerts from Dynamo to alert processors #642

Merged
merged 7 commits into from
Mar 21, 2018

Conversation

austinbyers
Copy link
Contributor

@austinbyers austinbyers commented Mar 19, 2018

to: @ryandeivert
cc: @airbnb/streamalert-maintainers
size: large

Background

This new Lambda function (the Alert Merger) will, as its name implies, eventually be responsible for alert merging. For now, it just dispatches alerts from the alerts Dynamo table and sends them to the alert processor.

The lifecycle of an alert is now as follows:

  1. The rule processor generates alerts and saves them in batches to Dynamo
  2. The alert merger runs every minute, finding all alerts in Dynamo which have not yet been sent or which need to be retried
  3. The alert merger invokes the alert processor once for each alert, providing the Dynamo record as the payload
  4. The alert processor deletes the Dynamo record once it is sent successfully, or else updates the list of outputs which need to be retried

Note that this PR does not implement alert merging, it just sets up the last piece of infra required to make it happen

Changes

  • Rule processor no longer invokes alert processor
  • Add a new Alert Merger Lambda function, which runs every minute
  • Update the alert processor with the new invocation format
  • Support metric filters in the tf_lambda module
  • Add a new AlertAttempts custom metric emitted by the Alert MErger

Testing

  • 100% coverage for the alert merger main
  • Deploy to a test account and verify alerts end-to-end
  • Deploy just the alert merger: ./manage.py lambda deploy -p alert_merger
  • Rollback just the alert merger: ./manage.py lambda rollback -p alert_merger
  • Try an alert which can't be sent and verify it is retried
  • Enable and disable custom metrics

@austinbyers austinbyers added this to the 2.0.0 milestone Mar 19, 2018
@austinbyers austinbyers requested a review from ryandeivert March 19, 2018 20:08
@coveralls
Copy link

coveralls commented Mar 19, 2018

Coverage Status

Coverage decreased (-0.2%) to 95.554% when pulling a53994f on austin-merger into e6bd483 on master.

ryandeivert
ryandeivert previously approved these changes Mar 19, 2018
Copy link
Contributor

@ryandeivert ryandeivert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 this is so good!!!! no real requests from me :) well done!


def handler(event, context): # pylint: disable=unused-argument
"""Entry point for the alert merger."""
global MERGER # pylint: disable=global-statement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the global keyword was supposed to avoided whenever possible. We could favor class properties (not instance properties) that would keep the client active for the life of the container. I thought we were doing this elsewhere much I'm having trouble finding it.. but basically this:

class Test(object):
    CLS_PROP = None

    def __init__(self):
        if not Test.CLS_PROP:
            Test.CLS_PROP = "now created"

    def get_prop(self):
        return self.CLS_PROP

for i in range(2):
    # when i == 1 the object is already cached
    print 'Class prop before init #%d: %s' % (i+1, Test.CLS_PROP)
    tester = Test()
    print 'Class prop after init #%d: %s' % (i+1, Test.CLS_PROP)
    print 'Instance prop after init #%d: %s' % (i+1, tester.get_prop())

I like this because it still encapsulates the client within the class that actually uses it.

EDIT: Pardon my naivety - I was thinking this was a client wrapper but quickly realized it is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's a good point. After some discussion, I decided to go with your suggestion and cache the instantiation in the class itself. It makes the handler itself super clean:

def handler(event, context):
    return AlertProcessor.get_instance(context.invoked_function_arn).run(event)

except ClientError as error:
if error.response['Error']['Code'] == 'ConditionalCheckFailedException':
LOGGER.warn('Conditional update failed: %s', error.response['Error']['Message'])
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some context as to why this particular error is okay and only logs, while others will be raised?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible (though very unlikely) that the alert processor could delete the entry from the table before the alert merger updates it with its additional state. If that happens, that's totally fine - we want the rows to be deleted when we don't need them anymore.

I removed the logging entirely - it shouldn't even be a warning, it's just an unusual order of operations. I also clarified with some more comments

@austinbyers austinbyers dismissed ryandeivert’s stale review March 21, 2018 00:48

rebase + new metric == substantial change

@austinbyers
Copy link
Contributor Author

austinbyers commented Mar 21, 2018

@ryandeivert Thanks for the feedback, I've addressed it! I also rebased with your recent change (#643 ) and added support for custom metrics in the alert merger. I tested again end-to-end with a new deploy, but I felt these were substantial enough code changes to require another review when you get a chance!

Copy link
Contributor

@ryandeivert ryandeivert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for these changes! 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑 🐑

if not MERGER:
MERGER = AlertMerger()
MERGER.dispatch()
AlertMerger.get_instance().dispatch()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is excellent! :) thanks for dealing with my requests!

@@ -60,6 +64,9 @@ class MetricLogger(object):
FIREHOSE_FAILED_RECORDS = 'FirehoseFailedRecords'
NORMALIZED_RECORDS = 'NormalizedRecords'

# Alert Merger metric names
ALERT_ATTEMPTS = 'AlertAttempts'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants