Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rule processor] Support data normalization #285

Merged
merged 9 commits into from
Sep 7, 2017

Conversation

chunyong-lin
Copy link
Contributor

@chunyong-lin chunyong-lin commented Aug 31, 2017

to @ryandeivert @mime-frame
cc @austinbyers
size: medium
contributes to: #105

Background

Currently, logs from different log sources (CarbonBlack, Osquery etc) can capture the same types of information, but using different schemas or different key names. Thus, it requires writing a rule for every log source and referencing the specific key-names used.

Example:

  • Log1 from LogSource1: {"host": "bob-pc", "cmdline": "wget 'evil.com'"}
  • Log2 from LogSource2: {"src_host": "bob-pc", "command": "wget 'hacked.com'"}

Using example above, it requires to write a rule for each log if you want to alert on any suspicious use of wget. Example rules are like so:

Rule 1 for Log1 (pseudocode):
if rec['cmdline'] matches "wget *":
	return true
return false
Rule 2 for Log2 (pseudocode):
if rec['command'] matches "wget *":
	return true
return false

Data normalization feature will help you achieve writing one rule against all relevant logs. Applying data normalization, you can write one rule to alert on wget usage on logs shown above.

"""This is sample rule to alert on any suspicious use of wget"""
import fnmatch
from helpers.base import fetch_values_by_datatype

@rule(datatypes=['command'])
def alert_suspecious_wget(rec):
    results = fetch_values_by_datatype(rec, 'command')
    for result in results:
        if fnmatch(result, "wget *"):
            return true
    return false

We are using CEF (Common Event Format) standard to define normalized types. Please refer to CEF documentation (PDF) if you want to define your custom normalized types.

Changes

  • New configuration file types.json
    This configuration file defines normalized types for each log source. This example shows the normalized types define for logs from cloudwatch.
{
  "cloudwatch":{
    "destinationAccount": ["recipientAccountId"],
    "destinationAddress": ["destination"],
    "destinationPort": ["destport"],
    "eventName": ["eventName"],
    "eventType": ["eventType"],
    "region": ["region"],
    "sourceAccount": ["account"],
    "sourceAddress": ["source", "sourceIPAddress"],
    "sourcePort": ["srcport"],
    "transportProtocol": ["protocol"],
    "userAgent": ["userAgent"],
    "userName": ["userName", "owner", "invokedBy"]
  }
}
  • Data normalization logic is applied in rule processor.

Testing

  • Add sample rules using data normalization feature and integration test.
    • Integration test was passed.
  • Add unit test.
    • Unit test was passed.
  • Also it was tested in testing AWS account.
    • Normalized type sourceAddress was working correctly
    • Normalized type eventType was working correctly
    • Normalized type userName was working correctly

Upcoming

  • Types validation (validate value of normalized types, ipv4, domain, cmd, etc.).
  • Documentation about Data Normalization in details.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.003%) to 95.211% when pulling 7093697 on new_data_normalization into ec648d1 on master.

@chunyong-lin chunyong-lin force-pushed the new_data_normalization branch from fde052e to 70fc2e5 Compare August 31, 2017 21:43
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.003%) to 95.207% when pulling 70fc2e5 on new_data_normalization into abdbd24 on master.

Copy link
Contributor

@austinbyers austinbyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be huge!

helpers/base.py Outdated
(list) The values of normalized types
"""
results = []
if not datatype in rec['normalized_types'].keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the .keys() is superfluous and you can remove it

@@ -0,0 +1,26 @@
"""Alert on matching IP address from aws access."""
from stream_alert.rule_processor.rules_engine import StreamRules
from helpers.base import fetch_values_by_datatype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: To be alphabetical, the helpers.base import should come first

for result in results:
if result == '1.1.1.2':
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 This is a great example that helped me understand how the normalization can be used

outputs=['aws-s3:sample-bucket',
'pagerduty:sample-integration',
'slack:sample-channel'])
def cloudtrail_aws_access_by_evil(rec):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer put "sample" rules in the public repository, only real ones. I'll work with you offline to choose which ones to open source :)

@@ -88,6 +88,14 @@ def _validate_config(config):
raise ConfigError(
'List of \'logs\' is empty for entity: {}'.format(entity))

# validate supported normalized types
supported_logs = [
'carbonblack', 'cloudwatch', 'cloudtrail', 'ghe', 'osquery', 'pan'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a hardcoded list? Doesn't seem right... why not pull from the config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is to validate if the log sources defined in the types.json supported. I will remove this code, looks redundant. We assume the log sources defined in types.json are supported.

if not (datatypes and cls.validate_datatypes(normalized_types, datatypes)):
return results

for key, val in record.iteritems(): # pylint: disable=too-many-nested-blocks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, that pylint warning exists for a reason :)

Here's how to avoid having too much nesting: https://blog.rburchell.com/2010/11/coding-antipatterns-excessive-nesting.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a suggestion: To handle nested types better, instead of only accounting for 2 levels deep, you could call match_types from within this loop and if val is a dict. You'd have to instantiate the results dict outside of this function and pass it to this function by reference to be updates, but it could easily be done. We can talk more about recursion offline if you'd like but this is a perfect use case for it


@classmethod
def validate_datatypes(cls, normalized_types, datatypes):
"""validate if datatypes valid in normalized_types for certain log
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar, not sure what this is trying to say

helpers/base.py Outdated
return results

for key in rec['normalized_types'][datatype]:
# Normalized type may be in nested subkeys, we only support one level of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, this will work for cases where envelope_keys is used, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my live test (which has envelope_keys defined), it works for cases where envelope_keys is used.

Copy link
Contributor

@ryandeivert ryandeivert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey chunyong -- sorry for the wait but I added a few additional comments!

if not (datatypes and cls.validate_datatypes(normalized_types, datatypes)):
return results

for key, val in record.iteritems(): # pylint: disable=too-many-nested-blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a suggestion: To handle nested types better, instead of only accounting for 2 levels deep, you could call match_types from within this loop and if val is a dict. You'd have to instantiate the results dict outside of this function and pass it to this function by reference to be updates, but it could easily be done. We can talk more about recursion offline if you'd like but this is a perfect use case for it

types_result = cls.match_types(record,
payload.normalized_types,
rule.datatypes)
record.update({'normalized_types': types_result})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to use .update here. update is useful if you have a preexisting dictionary that you want to 'append' to another dict, but since you're only adding one key/value item to the dictionary just use:

record['normalized_types'] = types_result

(boolean): return true if all datatypes are defined
"""
if not normalized_types:
LOGGER.error('Normalized_types is empty.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a human readable string in the logger statement here ('Normalized types dictionary is empty')

return False

for datatype in datatypes:
if not datatype in normalized_types.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .keys() call here is unnecessary. Please utilize if not datatype in normalized_types

supported_logs = [
'carbonblack', 'cloudwatch', 'cloudtrail', 'ghe', 'osquery', 'pan'
]
for log_type in config['types'].keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .keys() call here is not required

@chunyong-lin chunyong-lin force-pushed the new_data_normalization branch 2 times, most recently from aefe50d to dc3c0ee Compare September 5, 2017 22:58
@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 95.332% when pulling dc3c0ee on new_data_normalization into d988eaf on master.

@chunyong-lin
Copy link
Contributor Author

@ryandeivert @mime-frame @austinbyers PTAL.

ghost
ghost previously requested changes Sep 6, 2017
conf/types.json Outdated
"sourceAddress": ["sourceIPAddress"]
},
"ghe": {
"processName": ["program"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GHE's case, program is not a real process, please remove this

conf/types.json Outdated
"ghe": {
"processName": ["program"],
"userName": ["current_user"],
"destinationAddress": ["remote_address"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an example GHE log with remote_address - lets chat offline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is defined in the GHE schema.

conf/types.json Outdated
"transportProtocol": ["protocol"],
"severity": ["severity"],
"environmentIdentifier": ["envIdentifier"],
"roleIdentifier": ["roleIdentifier"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a custom decoration that only we do - remove

conf/types.json Outdated
"filePath": ["path"],
"transportProtocol": ["protocol"],
"severity": ["severity"],
"environmentIdentifier": ["envIdentifier"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a custom decoration that only we do - remove

conf/types.json Outdated
"command": ["cmdline", "command"],
"message": ["message"],
"sourceAddress": ["host", "source", "local_address", "address"],
"destinationAddress": ["destination", "remote_address", "gateway"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to account for values like ::1 when you write the validation classes

conf/types.json Outdated
"destinationAddress": ["remote_address"],
"sourcePort": ["port"]
},
"osquery": {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "fileHash": ["md5", "sha1", "sha256"] (see https://osquery.io/docs/tables/#hash)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add sourceUserId, have the array contain uid (see https://osquery.io/docs/tables/#users)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add receiptTime, have the array contain unixTime (all osquery logs have this field, it denotes when the info was collected)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add fileSize, have it contain size. ex: https://osquery.io/docs/tables/#file_events

conf/types.json Outdated
},
"osquery": {
"userName": ["username", "user"],
"filePath": ["path"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add directory to the filePath array (see https://osquery.io/docs/tables/#hash)

@ghost ghost dismissed their stale review September 6, 2017 18:34

done with my review

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 95.332% when pulling e957042 on new_data_normalization into d988eaf on master.

Chunyong Lin added 8 commits September 6, 2017 16:49
* Add a configuration file conf/types.json
* Add data normalization logic in rule processor
* Add a sample rule for integration test
* Add unit test cases
We defined 24 normalized types:
    "account"
    "agent"
    "cluster"
    "cmd"
    "domain"
    "event_name"
    "event_type"
    "hashmd5"
    "host"
    "ipv4"
    "msg"
    "name"
    "os"
    "path"
    "port"
    "process"
    "protocol"
    "region"
    "role"
    "score"
    "sev"
    "user_type"
    "username"
    "vend"
Refactor two methods to be able to traverse all nested keys.
* `fetch_values_by_datatype`
* `match_types`
* Use CEF stardard to define normalized types.
* The bug is introduced when return of method `match_types` mixing with string and list, e.g. ['key1', ['key2', 'subkey2']].
* The solution is to enforce return to be [list1, list2, list3, ...]
@chunyong-lin chunyong-lin force-pushed the new_data_normalization branch from d6329b9 to f97a6a7 Compare September 6, 2017 23:52
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling d2f2ebf on new_data_normalization into ** on master**.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 95.332% when pulling d6329b9 on new_data_normalization into 8942c7d on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 95.278% when pulling f97a6a7 on new_data_normalization into a7f5d4f on master.

Copy link
Contributor

@ryandeivert ryandeivert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chunyong-lin a few small comments I would like you to address before merging, but I am stamping!!

Congrats on your first big contribution - this is going to be a HUGE!

return results

@classmethod
def update(cls, results, parent_key, nested_results):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a description of the args to this function's docstring?


def test_fetch_values_by_datatype():
"""Helpers - Fetch values from a record by normalized type"""
rec = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the unicode (u'..) prefix from these strings, or are they required for your tests? They typically aren't necessary unless you explicitly need the strings as unicode. Please also remove from your test on line 125 below if you can.

results = dict()
for key, val in record.iteritems():
if isinstance(val, dict):
nested_results = cls.match_types_helper(val, normalized_types, datatypes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! 😸

in nested_results are original keys of normalized types.
nested_results (dict): A dict of normalized_types from nested record

Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chunyong-lin this function doesn't actually 'return' anything. please remove this since it alters a dictionary by reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 95.278% when pulling 78a313b on new_data_normalization into a7f5d4f on master.

@chunyong-lin chunyong-lin force-pushed the new_data_normalization branch from 78a313b to e6add10 Compare September 7, 2017 18:23
@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 95.278% when pulling e6add10 on new_data_normalization into a7f5d4f on master.

@chunyong-lin chunyong-lin merged commit 919f94d into master Sep 7, 2017
@chunyong-lin chunyong-lin deleted the new_data_normalization branch September 7, 2017 18:50
@ryandeivert ryandeivert added this to the 1.5.0 milestone Sep 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants