Create shared AlertTable and Alert (with merge algorithm) #666

austinbyers · 2018-03-30T23:31:27Z

to: @ryandeivert @jacknagz
cc: @airbnb/streamalert-maintainers
size: large

Background

This implements the actual algorithm for merging alerts. Merging is not being enabled, this is the logic to make it possible.

In addition, to make this change and future development easier, I've taken the time to refactor an Alert and AlertTable class as shared libraries used by all core Lambda functions.

Changes

Currently, the alert merger sends the raw alert payload to the alert processor. However, there is a relatively low limit on Lambda invocation payload size (128 KB). Since merged alerts may be even bigger than normal, we get ahead of this problem by sending only the alert ID / rule name to the alert processor. Then the alert processor will download the full record from Dynamo.
Create AlertTable class in shared/ for the Dynamo alerts table
- This class handles all queries and updates to the alerts table
- Rule Processor sends alerts in batches to Dynamo
- Alert Merger finds all the alerts in Dynamo and sends their IDs to Alert Processors
- Alert Processor downloads the full alert payload from Dynamo and then deletes it (if successful) or updates it (if some outputs failed)
Create Alert class in shared/, replacing the dict/OrderedDict in use now
- This class stores all of the alert data, converts the alert between formats (e.g. to and from Dynamo), creates merged alerts, etc
- The Alert includes a new staged attribute to be used when rule baking is implemented
- Rule Processor creates the Alert objects
- Alert Merger will merge multiple Alerts into a new Alert (future change)
- Alert Processor reads the attributes of the Alert. The Alert instance is now passed to output subclasses instead of a dictionary.
RetryOutputs Improvements
- If all outputs failed to send, there is no need to update RetryOutputs - the next alert processor will retry the same set of outputs.
- The RetryOutputs update will now fail if the alert no longer exists. This completely eliminates the possibility of the update operation creating an invalid entry in the table if the alert was already deleted somehow (e.g. manually)
Fixes a bug introduced by [cli][tf] lambda tf module generation now expects the lambda config from caller #664 where the CLI raised an exception when trying to generate metric filters for functions using the tf_lambda module

Testing

100% unit test coverage for the Alert and AlertTable classes
Deployed to a test account: verified alerts worked via Firehose (and landed in Athena), SNS, and Slack

…ert-class

coveralls · 2018-03-30T23:36:00Z

Coverage increased (+0.2%) to 95.889% when pulling 4090b6f on austin-alert-class into 63264bf on master.

austinbyers · 2018-03-31T07:23:06Z

Also, just realized I got PR 666! 😈 😈 😈

jacknagz

Awesome work! One quick comment

jacknagz · 2018-03-31T00:42:25Z

stream_alert/shared/alert_table.py

+        """
+        # Update the alerts table with the dispatch time, but only if the alert still exists.
+        # (The alert processor could have deleted the alert before this table update finishes).
+        try:


Could this try/except live in a wrapper function to update_item instead that mark_as_dispatched and update_retry_outputs use?

Yeah, good idea. In my original implementation, it was only on one of the functions

In fact, creating a decorator turned out to be even cleaner!

@ignore_conditional_failure def update_table(self, ...)

ryandeivert

A few minor comments and request for clarification, but otherwise this looks awesome!! :)

ryandeivert · 2018-04-02T17:26:13Z

.pylintrc

@@ -367,7 +367,7 @@ max-bool-expr=5
 max-branches=25

 # Maximum number of locals for function / method body
-max-locals=25
+max-locals=30


where do we draw the line here? 🤣

True. Pylint was complaining that the Alert class had too many attributes. I felt it made sense as is, but if it grew considerably it should be refactored, which is why I raised this threshold instead of disabling the check on the class, for example. Do you feel we should restructure the class?

ryandeivert · 2018-04-02T17:35:46Z

stream_alert_cli/terraform/lambda_module.py

@@ -68,14 +68,16 @@ def _tf_vpc_config(lambda_config):
    return result


-def generate_lambda(function_name, lambda_config, config, environment=None):
+def generate_lambda(function_name, lambda_config, config, environment=None, metrics_lookup=None):


sorry for this - thanks for fixing!!!

ryandeivert · 2018-04-02T17:52:48Z

stream_alert/alert_processor/main.py

+        elif any(output_results.values()):
+            # At least one output succeeded - update the table with those outputs which need retried
+            self.alerts_table.update_retry_outputs(alert)
+        # If all outputs failed, no table updates are necessary


this hanging comment is a bit odd - what line is it related to exactly? maybe it's the else case..?

Yeah, it's the else case, but having else: pass seemed unnecessary. If I start the comment with # else: ... would that be clearer?

ryandeivert · 2018-04-02T17:55:47Z

stream_alert/alert_processor/main.py

-            if dispatcher:
-                results[output] = self._send_alert(payload, output, dispatcher)
+        # Remove normalization key from the record
+        if NORMALIZATION_KEY in alert.record:


what is removal of this key required? don't we want this information on the receiving end? I don't believe we've done this in the past

Yep, we are currently removing the data normalization key when building the alert payload here:

streamalert/stream_alert/alert_processor/main.py

Lines 65 to 89 in 8692807

@staticmethod

def _build_alert_payload(record):

"""Transform a raw Dynamo record into a payload ready for dispatching.

Args:

record (dict): A row in the Dynamo alerts table

Returns:

OrderedDict: An alert payload ready to be sent to output dispatchers.

"""

# Any problems with the alert keys or JSON loading will raise an exception here.

# This is what we want - an invalid alert is a show-stopper and shouldn't ever happen.

return ordered_dict({

'cluster': record['Cluster'],

'created': record['Created'],

'id': record['AlertID'],

'log_source': record['LogSource'],

'log_type': record['LogType'],

'outputs': record['Outputs'],

'record': json.loads(record['Record']),

'rule_description': record['RuleDescription'],

'rule_name': record['RuleName'],

'source_entity': record['SourceEntity'],

'source_service': record['SourceService']

}, exclude_keys={NORMALIZATION_KEY})

We could discuss adding it back in, but for now I want to keep the same behavior

sure I see that we're doing that now- but before the changes you link to, were we doing that? I'm unsure why we would want to remove these?

Removing normalized keys from outputs has been in place since September: 201aa1e#diff-292514b478ade4d1cb4f29a7b9c32528

Perhaps @chunyong-lin can provide more context, but my guess is that it doesn't provide much useful information when triaging the alert, since it is essentially a copy of many of the fields in the alert. Saying source_hash_md5 = 'ABCD' in the alert AND streamalert:normalizedtypes = { 'md5': 'ABCD'} makes the PagerDuty unnecessarily verbose, for example.

Note that threat intel IOCs are retained.

But I 100% agree that it would be useful to have this information at least in Athena, so we could query for all alerts which contain a given type. Thus far, we have sent exactly the same alert payload to every output. With alert merging, this will be changing, and we should definitely revisit this soon

I've added a TODO to make sure we don't forget about it - great point

sure - that makes sense :) thanks for the context!

ryandeivert · 2018-04-02T17:57:51Z

stream_alert/alert_processor/outputs/aws.py

@@ -63,7 +63,7 @@ def format_output_config(cls, service_config, values):
                    **{values['descriptor'].value: values['aws_value'].value})

    @abstractmethod
-    def dispatch(self, **kwargs):
+    def dispatch(self, alert, descriptor):


thank you for this change!! this is how I initially wanted it :)

ryandeivert · 2018-04-02T22:35:09Z

stream_alert/shared/alert.py

+                )
+            )
+
+        self.alert_id = kwargs.get('alert_id') or str(uuid.uuid4())


instead of doing the kwargs.get(...) or .... logic, why not just supply a default value to the kwargs.get()? for example: kwargs.get('alert_id', str(uuid.uuid4())) since kwargs is just a dictionary. is there some benefit I'm missing?

I address this in a comment a few lines down, but the basic problem is that you could have Falsey values of the wrong type.

In particular, Dynamo will reject empty strings and empty sets. So if someone were to call Alert(alert_id=''), a simple kwargs.get(alert_id) will accept the empty string value, which we don't want. The idea of using or is to force any Falsey value (0, None, '', [], {}) to the Falsey value of the appropriate type.

Happy to discuss whether this is necessary in every case (alert_id in particular should never be an empty string, but cluster or log_source could be), but it seemed a bit safer to enforce the correct types from the beginning of Alert creation

aha! this is actually the reasoning I had behind this code awhile back, but removed that because I couldn't recall my reasoning :LOL: the gist there was that I wanted to support having a value in the config (so the user knew the config setting existed) but if it said value was blank we would fall back on a default. a typical .get(...) was not enough for this.

Thanks for the explanation :) I'm okay with leaving as is (and sorry I didn't see your comment :P)

I expanded on the comment and moved it further up to make it easier to see

ryandeivert · 2018-04-02T22:40:22Z

stream_alert/shared/alert.py

+        if not set(kwargs).issubset(self._EXPECTED_INIT_KWARGS):
+            raise TypeError(
+                'Invalid Alert kwargs: {} is not a subset of {}'.format(
+                    ','.join(sorted(kwargs)), ','.join(sorted(self._EXPECTED_INIT_KWARGS))


Since this is going to result in a very large Exception message due to there being a large list of _EXPECTED_INIT_KWARGS - I'd suggest using a difference of the sets to make this more readable/actionable. something like:

raise TypeError( 'Invalid Alert kwargs: {} argument(s) are not in the expected set of {}'.format( ', '.join(sorted(set(kwargs).difference(self._EXPECTED_INIT_KWARGS))), ', '.join(sorted(self._EXPECTED_INIT_KWARGS))

also putting a space after the comma in the join will help with readability

Good idea, thanks! Will do

ryandeivert · 2018-04-02T22:43:29Z

stream_alert/shared/alert.py

+        Returns:
+            Alert: An alert with all properties populated from the Dynamo item
+        """
+        return cls(


ryandeivert · 2018-04-02T22:44:15Z

stream_alert/shared/alert.py

+            'id': self.alert_id,
+            'log_source': self.log_source or '',
+            'log_type': self.log_type or '',
+            'outputs': list(sorted(self.outputs)),  # List instead of set for JSON-compatibility


good use of comments ++

ryandeivert · 2018-04-02T23:00:42Z

stream_alert/shared/alert.py

+        }
+
+        # Union all of the outputs together.
+        all_outputs = set()


hmm I'm not sure if this was covered in the design, but I don't know how I feel about this part. maybe you can offer some clarification, but it appears that similar alerts that have an output of say "whatever-bucket" can be merged with other alerts that do not have the output of "whatever-bucket" and instead have an output of "foo-firehose". is this correct? If so, I don't think this should happen. I believe the merge should only occur for alerts going to the same exact outputs. maybe I'm missing it or this logic exists elsewhere, but this sorta goes against the idea of segmentation.

for instance, if we configure a specific rule that alerts on sensitive info to send to one output (that is specific for this rule's use case), I'm thinking this logic could result in that alerts landing in other outputs as well. let me know if I'm not seeing this correctly...

EDIT: after speaking with jack and continuing my review I think maybe this merging only happens per-rule and not across rules (please confirm). in which case, when would the outputs differ and need to be union-ed here?

Correct - merging will only happen on the same rule name.

The possibility of different outputs comes from large merge windows. A rule may have a merge window of 24-72 hours, and the outputs listed in that rule (or any parameters, really) could change in that timeframe. The most conservative approach here is to send to every output which was defined at any point during the merge window, but alternatively we could just use the set of outputs in the most recent version of the rule

I like the idea of using the latest set of outputs, but could go either way on this. let's leave it as is and maybe discuss if it become a problem for some reason

I agree latest might be more clear. Otherwise, someone might remove an output and wonder why, 12 hours a later, a merged alert raised a bunch of errors for trying to deliver to an output which doesn't exist.

…ert-class

austinbyers · 2018-04-06T18:50:58Z

PTAL @jacknagz @ryandeivert. I've addressed your feedback and confirmed it still works end to end

ryandeivert

changes LGTM! thanks 🎉

Austin Byers added 9 commits March 28, 2018 15:29

Create shared AlertTable class

52d5ed7

Create shared Alert class

5dbb6b4

Update rule processor to use shared classes

cdec391

Update alert merger to use shared classes

65ebeb8

Update alert processor with the shared classes

9ba1d40

Update CLI tests

8c04545

Patch test environment with AWS_DEFAULT_REGION to appease boto

562bec1

Merge branch 'master' of github.com:airbnb/streamalert into austin-al…

b689823

…ert-class

Fix bug with generating metric filters in CLI

324d1b4

austinbyers requested a review from jacknagz March 30, 2018 23:31

austinbyers added this to the 2.0.0 milestone Mar 30, 2018

austinbyers requested a review from ryandeivert March 31, 2018 00:44

jacknagz reviewed Apr 2, 2018

View reviewed changes

ryandeivert approved these changes Apr 2, 2018

View reviewed changes

Austin Byers added 2 commits April 6, 2018 11:40

Address feedback

f573167

Merge branch 'master' of github.com:airbnb/streamalert into austin-al…

f870e16

…ert-class

Include cluster information when creating an Alert

4090b6f

ryandeivert reviewed Apr 6, 2018

View reviewed changes

austinbyers merged commit 943e6cc into master Apr 6, 2018

austinbyers deleted the austin-alert-class branch April 6, 2018 20:51

This was referenced Apr 9, 2018

Bug fix: Slack rule name in message #672

Merged

Create and deliver merged alerts! #681

Merged

ryandeivert added alert merging new feature labels Jul 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create shared AlertTable and Alert (with merge algorithm) #666

Create shared AlertTable and Alert (with merge algorithm) #666

austinbyers commented Mar 30, 2018 •

edited

Loading

coveralls commented Mar 30, 2018 •

edited

Loading

austinbyers commented Mar 31, 2018

jacknagz left a comment

jacknagz Mar 31, 2018

austinbyers Apr 4, 2018

austinbyers Apr 6, 2018 •

edited

Loading

ryandeivert left a comment

ryandeivert Apr 2, 2018

austinbyers Apr 4, 2018

ryandeivert Apr 2, 2018

ryandeivert Apr 2, 2018

austinbyers Apr 4, 2018

ryandeivert Apr 2, 2018

austinbyers Apr 4, 2018

ryandeivert Apr 4, 2018

austinbyers Apr 6, 2018

austinbyers Apr 6, 2018

ryandeivert Apr 6, 2018

ryandeivert Apr 2, 2018

austinbyers Apr 4, 2018

ryandeivert Apr 2, 2018

austinbyers Apr 4, 2018

ryandeivert Apr 4, 2018

austinbyers Apr 6, 2018

ryandeivert Apr 2, 2018

austinbyers Apr 4, 2018

ryandeivert Apr 2, 2018

ryandeivert Apr 2, 2018

ryandeivert Apr 2, 2018

austinbyers Apr 4, 2018

ryandeivert Apr 4, 2018

austinbyers Apr 6, 2018 •

edited

Loading

austinbyers commented Apr 6, 2018 •

edited

Loading

ryandeivert left a comment

	@staticmethod
	def _build_alert_payload(record):
	"""Transform a raw Dynamo record into a payload ready for dispatching.

	Args:
	record (dict): A row in the Dynamo alerts table

	Returns:
	OrderedDict: An alert payload ready to be sent to output dispatchers.
	"""
	# Any problems with the alert keys or JSON loading will raise an exception here.
	# This is what we want - an invalid alert is a show-stopper and shouldn't ever happen.
	return ordered_dict({
	'cluster': record['Cluster'],
	'created': record['Created'],
	'id': record['AlertID'],
	'log_source': record['LogSource'],
	'log_type': record['LogType'],
	'outputs': record['Outputs'],
	'record': json.loads(record['Record']),
	'rule_description': record['RuleDescription'],
	'rule_name': record['RuleName'],
	'source_entity': record['SourceEntity'],
	'source_service': record['SourceService']
	}, exclude_keys={NORMALIZATION_KEY})

Create shared AlertTable and Alert (with merge algorithm) #666

Create shared AlertTable and Alert (with merge algorithm) #666

Conversation

austinbyers commented Mar 30, 2018 • edited Loading

Background

Changes

Testing

coveralls commented Mar 30, 2018 • edited Loading

austinbyers commented Mar 31, 2018

jacknagz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

ryandeivert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

austinbyers commented Apr 6, 2018 • edited Loading

ryandeivert left a comment

Choose a reason for hiding this comment

austinbyers commented Mar 30, 2018 •

edited

Loading

coveralls commented Mar 30, 2018 •

edited

Loading

austinbyers Apr 6, 2018 •

edited

Loading

austinbyers Apr 6, 2018 •

edited

Loading

austinbyers commented Apr 6, 2018 •

edited

Loading