[rule processor] Support data normalization #285

chunyong-lin · 2017-08-31T16:59:18Z

to @ryandeivert @mime-frame
cc @austinbyers
size: medium
contributes to: #105

Background

Currently, logs from different log sources (CarbonBlack, Osquery etc) can capture the same types of information, but using different schemas or different key names. Thus, it requires writing a rule for every log source and referencing the specific key-names used.

Example:

Log1 from LogSource1: {"host": "bob-pc", "cmdline": "wget 'evil.com'"}
Log2 from LogSource2: {"src_host": "bob-pc", "command": "wget 'hacked.com'"}

Using example above, it requires to write a rule for each log if you want to alert on any suspicious use of wget. Example rules are like so:

Rule 1 for Log1 (pseudocode):
if rec['cmdline'] matches "wget *":
	return true
return false

Rule 2 for Log2 (pseudocode):
if rec['command'] matches "wget *":
	return true
return false

Data normalization feature will help you achieve writing one rule against all relevant logs. Applying data normalization, you can write one rule to alert on wget usage on logs shown above.

"""This is sample rule to alert on any suspicious use of wget"""
import fnmatch
from helpers.base import fetch_values_by_datatype

@rule(datatypes=['command'])
def alert_suspecious_wget(rec):
    results = fetch_values_by_datatype(rec, 'command')
    for result in results:
        if fnmatch(result, "wget *"):
            return true
    return false

We are using CEF (Common Event Format) standard to define normalized types. Please refer to CEF documentation (PDF) if you want to define your custom normalized types.

Changes

New configuration file types.json
This configuration file defines normalized types for each log source. This example shows the normalized types define for logs from cloudwatch.

{
  "cloudwatch":{
    "destinationAccount": ["recipientAccountId"],
    "destinationAddress": ["destination"],
    "destinationPort": ["destport"],
    "eventName": ["eventName"],
    "eventType": ["eventType"],
    "region": ["region"],
    "sourceAccount": ["account"],
    "sourceAddress": ["source", "sourceIPAddress"],
    "sourcePort": ["srcport"],
    "transportProtocol": ["protocol"],
    "userAgent": ["userAgent"],
    "userName": ["userName", "owner", "invokedBy"]
  }
}

Data normalization logic is applied in rule processor.

Testing

Add sample rules using data normalization feature and integration test.
- Integration test was passed.
Add unit test.
- Unit test was passed.
Also it was tested in testing AWS account.
- Normalized type sourceAddress was working correctly
- Normalized type eventType was working correctly
- Normalized type userName was working correctly

Upcoming

Types validation (validate value of normalized types, ipv4, domain, cmd, etc.).
Documentation about Data Normalization in details.

coveralls · 2017-08-31T17:02:20Z

Coverage decreased (-0.003%) to 95.211% when pulling 7093697 on new_data_normalization into ec648d1 on master.

coveralls · 2017-08-31T21:46:22Z

Coverage decreased (-0.003%) to 95.207% when pulling 70fc2e5 on new_data_normalization into abdbd24 on master.

austinbyers

This is going to be huge!

austinbyers · 2017-08-31T23:52:59Z

helpers/base.py

+        (list) The values of normalized types
+    """
+    results = []
+    if not datatype in rec['normalized_types'].keys():


Nit: the .keys() is superfluous and you can remove it

austinbyers · 2017-08-31T23:53:34Z

rules/community/cloudtrail/cloudtrail_aws_access_by_evil.py

@@ -0,0 +1,26 @@
+"""Alert on matching IP address from aws access."""
+from stream_alert.rule_processor.rules_engine import StreamRules
+from helpers.base import fetch_values_by_datatype


Nit: To be alphabetical, the helpers.base import should come first

austinbyers · 2017-08-31T23:54:37Z

rules/community/cloudtrail/cloudtrail_aws_access_by_evil.py

+    for result in results:
+        if result == '1.1.1.2':
+            return True
+    return False


🎉 This is a great example that helped me understand how the normalization can be used

ghost · 2017-09-01T16:50:06Z

rules/community/cloudtrail/cloudtrail_aws_access_by_evil.py

+      outputs=['aws-s3:sample-bucket',
+               'pagerduty:sample-integration',
+               'slack:sample-channel'])
+def cloudtrail_aws_access_by_evil(rec):


We no longer put "sample" rules in the public repository, only real ones. I'll work with you offline to choose which ones to open source :)

ghost · 2017-09-01T16:51:05Z

stream_alert/rule_processor/config.py

@@ -88,6 +88,14 @@ def _validate_config(config):
                raise ConfigError(
                    'List of \'logs\' is empty for entity: {}'.format(entity))

+    # validate supported normalized types
+    supported_logs = [
+        'carbonblack', 'cloudwatch', 'cloudtrail', 'ghe', 'osquery', 'pan'


a hardcoded list? Doesn't seem right... why not pull from the config?

This code block is to validate if the log sources defined in the types.json supported. I will remove this code, looks redundant. We assume the log sources defined in types.json are supported.

ghost · 2017-09-01T16:54:27Z

stream_alert/rule_processor/rules_engine.py

+        if not (datatypes and cls.validate_datatypes(normalized_types, datatypes)):
+            return results
+
+        for key, val in record.iteritems(): # pylint: disable=too-many-nested-blocks


hah, that pylint warning exists for a reason :)

Here's how to avoid having too much nesting: https://blog.rburchell.com/2010/11/coding-antipatterns-excessive-nesting.html

As a suggestion: To handle nested types better, instead of only accounting for 2 levels deep, you could call match_types from within this loop and if val is a dict. You'd have to instantiate the results dict outside of this function and pass it to this function by reference to be updates, but it could easily be done. We can talk more about recursion offline if you'd like but this is a perfect use case for it

ghost · 2017-09-01T16:54:58Z

stream_alert/rule_processor/rules_engine.py

+
+    @classmethod
+    def validate_datatypes(cls, normalized_types, datatypes):
+        """validate if datatypes valid in normalized_types for certain log


grammar, not sure what this is trying to say

ghost · 2017-09-01T16:56:55Z

helpers/base.py

+        return results
+
+    for key in rec['normalized_types'][datatype]:
+        # Normalized type may be in nested subkeys, we only support one level of


To confirm, this will work for cases where envelope_keys is used, correct?

Based on my live test (which has envelope_keys defined), it works for cases where envelope_keys is used.

ryandeivert

Hey chunyong -- sorry for the wait but I added a few additional comments!

ryandeivert · 2017-09-05T16:58:31Z

stream_alert/rule_processor/rules_engine.py

+        if not (datatypes and cls.validate_datatypes(normalized_types, datatypes)):
+            return results
+
+        for key, val in record.iteritems(): # pylint: disable=too-many-nested-blocks


As a suggestion: To handle nested types better, instead of only accounting for 2 levels deep, you could call match_types from within this loop and if val is a dict. You'd have to instantiate the results dict outside of this function and pass it to this function by reference to be updates, but it could easily be done. We can talk more about recursion offline if you'd like but this is a perfect use case for it

ryandeivert · 2017-09-05T17:01:54Z

stream_alert/rule_processor/rules_engine.py

+                types_result = cls.match_types(record,
+                                               payload.normalized_types,
+                                               rule.datatypes)
+                record.update({'normalized_types': types_result})


You don't have to use .update here. update is useful if you have a preexisting dictionary that you want to 'append' to another dict, but since you're only adding one key/value item to the dictionary just use:

record['normalized_types'] = types_result

ryandeivert · 2017-09-05T17:50:23Z

stream_alert/rule_processor/rules_engine.py

+            (boolean): return true if all datatypes are defined
+        """
+        if not normalized_types:
+            LOGGER.error('Normalized_types is empty.')


Please use a human readable string in the logger statement here ('Normalized types dictionary is empty')

ryandeivert · 2017-09-05T17:51:00Z

stream_alert/rule_processor/rules_engine.py

+            return False
+
+        for datatype in datatypes:
+            if not datatype in normalized_types.keys():


The .keys() call here is unnecessary. Please utilize if not datatype in normalized_types

ryandeivert · 2017-09-05T17:53:50Z

stream_alert/rule_processor/config.py

+    supported_logs = [
+        'carbonblack', 'cloudwatch', 'cloudtrail', 'ghe', 'osquery', 'pan'
+    ]
+    for log_type in config['types'].keys():


The .keys() call here is not required

coveralls · 2017-09-05T23:48:23Z

Coverage increased (+0.06%) to 95.332% when pulling dc3c0ee on new_data_normalization into d988eaf on master.

chunyong-lin · 2017-09-06T00:19:45Z

@ryandeivert @mime-frame @austinbyers PTAL.

ghost · 2017-09-06T00:24:17Z

conf/types.json

+    "sourceAddress": ["sourceIPAddress"]
+  },
+  "ghe": {
+    "processName": ["program"],


In GHE's case, program is not a real process, please remove this

ghost · 2017-09-06T00:26:01Z

conf/types.json

+  "ghe": {
+    "processName": ["program"],
+    "userName": ["current_user"],
+    "destinationAddress": ["remote_address"],


I don't see an example GHE log with remote_address - lets chat offline

It is defined in the GHE schema.

ghost · 2017-09-06T00:26:52Z

conf/types.json

+    "transportProtocol": ["protocol"],
+    "severity": ["severity"],
+    "environmentIdentifier": ["envIdentifier"],
+    "roleIdentifier": ["roleIdentifier"],


this is a custom decoration that only we do - remove

ghost · 2017-09-06T00:27:00Z

conf/types.json

+    "filePath": ["path"],
+    "transportProtocol": ["protocol"],
+    "severity": ["severity"],
+    "environmentIdentifier": ["envIdentifier"],


this is a custom decoration that only we do - remove

ghost · 2017-09-06T00:28:04Z

conf/types.json

+    "command": ["cmdline", "command"],
+    "message": ["message"],
+    "sourceAddress": ["host", "source", "local_address", "address"],
+    "destinationAddress": ["destination", "remote_address", "gateway"],


make sure to account for values like ::1 when you write the validation classes

ghost · 2017-09-06T18:17:30Z

conf/types.json

+    "destinationAddress": ["remote_address"],
+    "sourcePort": ["port"]
+  },
+  "osquery": {


add "fileHash": ["md5", "sha1", "sha256"] (see https://osquery.io/docs/tables/#hash)

add sourceUserId, have the array contain uid (see https://osquery.io/docs/tables/#users)

add receiptTime, have the array contain unixTime (all osquery logs have this field, it denotes when the info was collected)

add fileSize, have it contain size. ex: https://osquery.io/docs/tables/#file_events

ghost · 2017-09-06T18:18:10Z

conf/types.json

+  },
+  "osquery": {
+    "userName": ["username", "user"],
+    "filePath": ["path"],


add directory to the filePath array (see https://osquery.io/docs/tables/#hash)

done with my review

coveralls · 2017-09-06T22:28:10Z

Coverage increased (+0.06%) to 95.332% when pulling e957042 on new_data_normalization into d988eaf on master.

* Add a configuration file conf/types.json * Add data normalization logic in rule processor * Add a sample rule for integration test * Add unit test cases

We defined 24 normalized types: "account" "agent" "cluster" "cmd" "domain" "event_name" "event_type" "hashmd5" "host" "ipv4" "msg" "name" "os" "path" "port" "process" "protocol" "region" "role" "score" "sev" "user_type" "username" "vend"

Refactor two methods to be able to traverse all nested keys. * `fetch_values_by_datatype` * `match_types`

* Use CEF stardard to define normalized types.

* The bug is introduced when return of method `match_types` mixing with string and list, e.g. ['key1', ['key2', 'subkey2']]. * The solution is to enforce return to be [list1, list2, list3, ...]

coveralls · 2017-09-07T00:06:55Z

Changes Unknown when pulling d2f2ebf on new_data_normalization into ** on master**.

coveralls · 2017-09-07T01:30:27Z

Coverage increased (+0.06%) to 95.332% when pulling d6329b9 on new_data_normalization into 8942c7d on master.

coveralls · 2017-09-07T02:31:20Z

Coverage increased (+0.002%) to 95.278% when pulling f97a6a7 on new_data_normalization into a7f5d4f on master.

ryandeivert

Hi @chunyong-lin a few small comments I would like you to address before merging, but I am stamping!!

Congrats on your first big contribution - this is going to be a HUGE!

ryandeivert · 2017-09-07T16:55:34Z

stream_alert/rule_processor/rules_engine.py

+        return results
+
+    @classmethod
+    def update(cls, results, parent_key, nested_results):


Can you add a description of the args to this function's docstring?

ryandeivert · 2017-09-07T17:00:04Z

tests/unit/stream_alert_rule_processor/test_rule_helpers.py

+
+def test_fetch_values_by_datatype():
+    """Helpers - Fetch values from a record by normalized type"""
+    rec = {


Can we remove the unicode (u'..) prefix from these strings, or are they required for your tests? They typically aren't necessary unless you explicitly need the strings as unicode. Please also remove from your test on line 125 below if you can.

ryandeivert · 2017-09-07T17:00:44Z

stream_alert/rule_processor/rules_engine.py

+        results = dict()
+        for key, val in record.iteritems():
+            if isinstance(val, dict):
+                nested_results = cls.match_types_helper(val, normalized_types, datatypes)


Well done! 😸

ryandeivert · 2017-09-07T18:14:03Z

stream_alert/rule_processor/rules_engine.py

+                in nested_results are original keys of normalized types.
+            nested_results (dict): A dict of normalized_types from nested record
+
+        Returns:


@chunyong-lin this function doesn't actually 'return' anything. please remove this since it alters a dictionary by reference

Good catch! Thanks!

coveralls · 2017-09-07T18:17:02Z

Coverage increased (+0.002%) to 95.278% when pulling 78a313b on new_data_normalization into a7f5d4f on master.

coveralls · 2017-09-07T18:27:04Z

Coverage increased (+0.002%) to 95.278% when pulling e6add10 on new_data_normalization into a7f5d4f on master.

chunyong-lin added improvement rules labels Aug 31, 2017

chunyong-lin force-pushed the new_data_normalization branch from fde052e to 70fc2e5 Compare August 31, 2017 21:43

austinbyers reviewed Aug 31, 2017

View reviewed changes

ghost suggested changes Sep 1, 2017

View reviewed changes

ryandeivert requested changes Sep 5, 2017

View reviewed changes

chunyong-lin force-pushed the new_data_normalization branch 2 times, most recently from aefe50d to dc3c0ee Compare September 5, 2017 22:58

ghost previously requested changes Sep 6, 2017

View reviewed changes

ghost reviewed Sep 6, 2017

View reviewed changes

Chunyong Lin added 8 commits September 6, 2017 16:49

[rule processor] Support data normalization

3eb0164

* Add a configuration file conf/types.json * Add data normalization logic in rule processor * Add a sample rule for integration test * Add unit test cases

[rule_processor] Refactor and address comments

964e945

Refactor two methods to be able to traverse all nested keys. * `fetch_values_by_datatype` * `match_types`

[rule] remove testing rule.

b1534e6

[conf] Use Common Event Format (CEF) standard

40e9179

* Use CEF stardard to define normalized types.

[conf] Address comments to types.json.

d629756

[rule processor] Fix a bug when log has nested keys.

0119f4f

* The bug is introduced when return of method `match_types` mixing with string and list, e.g. ['key1', ['key2', 'subkey2']]. * The solution is to enforce return to be [list1, list2, list3, ...]

[conf] Update osquery types and sort conf/types.json by types.

f97a6a7

chunyong-lin force-pushed the new_data_normalization branch from d6329b9 to f97a6a7 Compare September 6, 2017 23:52

ryandeivert approved these changes Sep 7, 2017

View reviewed changes

ryandeivert reviewed Sep 7, 2017

View reviewed changes

[docs|tests] Update method description and format test event.

e6add10

chunyong-lin force-pushed the new_data_normalization branch from 78a313b to e6add10 Compare September 7, 2017 18:23

chunyong-lin merged commit 919f94d into master Sep 7, 2017

chunyong-lin deleted the new_data_normalization branch September 7, 2017 18:50

ghost mentioned this pull request Sep 8, 2017

Bug: logs key required for rules that use datatypes #304

Closed

ryandeivert added the data normalization label Sep 12, 2017

ryandeivert added this to the 1.5.0 milestone Sep 12, 2017

ghost mentioned this pull request Sep 20, 2017

[q3] Feature: Normalization of Consumed Data #105

Closed

[rule processor] Support data normalization #285

[rule processor] Support data normalization #285

Conversation

chunyong-lin commented Aug 31, 2017 • edited Loading

Background

Changes

Testing

Upcoming

coveralls commented Aug 31, 2017

coveralls commented Aug 31, 2017

austinbyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryandeivert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Sep 5, 2017

chunyong-lin commented Sep 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Sep 6, 2017

coveralls commented Sep 7, 2017

coveralls commented Sep 7, 2017

coveralls commented Sep 7, 2017

ryandeivert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Sep 7, 2017

coveralls commented Sep 7, 2017

chunyong-lin commented Aug 31, 2017 •

edited

Loading