-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rule processor] Support data normalization #285
Conversation
fde052e
to
70fc2e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be huge!
helpers/base.py
Outdated
(list) The values of normalized types | ||
""" | ||
results = [] | ||
if not datatype in rec['normalized_types'].keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the .keys()
is superfluous and you can remove it
@@ -0,0 +1,26 @@ | |||
"""Alert on matching IP address from aws access.""" | |||
from stream_alert.rule_processor.rules_engine import StreamRules | |||
from helpers.base import fetch_values_by_datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: To be alphabetical, the helpers.base
import should come first
for result in results: | ||
if result == '1.1.1.2': | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 This is a great example that helped me understand how the normalization can be used
outputs=['aws-s3:sample-bucket', | ||
'pagerduty:sample-integration', | ||
'slack:sample-channel']) | ||
def cloudtrail_aws_access_by_evil(rec): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer put "sample" rules in the public repository, only real ones. I'll work with you offline to choose which ones to open source :)
@@ -88,6 +88,14 @@ def _validate_config(config): | |||
raise ConfigError( | |||
'List of \'logs\' is empty for entity: {}'.format(entity)) | |||
|
|||
# validate supported normalized types | |||
supported_logs = [ | |||
'carbonblack', 'cloudwatch', 'cloudtrail', 'ghe', 'osquery', 'pan' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a hardcoded list? Doesn't seem right... why not pull from the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is to validate if the log sources defined in the types.json
supported. I will remove this code, looks redundant. We assume the log sources defined in types.json
are supported.
if not (datatypes and cls.validate_datatypes(normalized_types, datatypes)): | ||
return results | ||
|
||
for key, val in record.iteritems(): # pylint: disable=too-many-nested-blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, that pylint warning exists for a reason :)
Here's how to avoid having too much nesting: https://blog.rburchell.com/2010/11/coding-antipatterns-excessive-nesting.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a suggestion: To handle nested types better, instead of only accounting for 2 levels deep, you could call match_types
from within this loop and if val
is a dict. You'd have to instantiate the results
dict outside of this function and pass it to this function by reference to be updates, but it could easily be done. We can talk more about recursion offline if you'd like but this is a perfect use case for it
|
||
@classmethod | ||
def validate_datatypes(cls, normalized_types, datatypes): | ||
"""validate if datatypes valid in normalized_types for certain log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar, not sure what this is trying to say
helpers/base.py
Outdated
return results | ||
|
||
for key in rec['normalized_types'][datatype]: | ||
# Normalized type may be in nested subkeys, we only support one level of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, this will work for cases where envelope_keys
is used, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my live test (which has envelope_keys
defined), it works for cases where envelope_keys
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey chunyong -- sorry for the wait but I added a few additional comments!
if not (datatypes and cls.validate_datatypes(normalized_types, datatypes)): | ||
return results | ||
|
||
for key, val in record.iteritems(): # pylint: disable=too-many-nested-blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a suggestion: To handle nested types better, instead of only accounting for 2 levels deep, you could call match_types
from within this loop and if val
is a dict. You'd have to instantiate the results
dict outside of this function and pass it to this function by reference to be updates, but it could easily be done. We can talk more about recursion offline if you'd like but this is a perfect use case for it
types_result = cls.match_types(record, | ||
payload.normalized_types, | ||
rule.datatypes) | ||
record.update({'normalized_types': types_result}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to use .update
here. update
is useful if you have a preexisting dictionary that you want to 'append' to another dict, but since you're only adding one key/value item to the dictionary just use:
record['normalized_types'] = types_result
(boolean): return true if all datatypes are defined | ||
""" | ||
if not normalized_types: | ||
LOGGER.error('Normalized_types is empty.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a human readable string in the logger statement here ('Normalized types dictionary is empty')
return False | ||
|
||
for datatype in datatypes: | ||
if not datatype in normalized_types.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .keys()
call here is unnecessary. Please utilize if not datatype in normalized_types
supported_logs = [ | ||
'carbonblack', 'cloudwatch', 'cloudtrail', 'ghe', 'osquery', 'pan' | ||
] | ||
for log_type in config['types'].keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .keys()
call here is not required
aefe50d
to
dc3c0ee
Compare
@ryandeivert @mime-frame @austinbyers PTAL. |
conf/types.json
Outdated
"sourceAddress": ["sourceIPAddress"] | ||
}, | ||
"ghe": { | ||
"processName": ["program"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In GHE's case, program
is not a real process, please remove this
conf/types.json
Outdated
"ghe": { | ||
"processName": ["program"], | ||
"userName": ["current_user"], | ||
"destinationAddress": ["remote_address"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an example GHE log with remote_address
- lets chat offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is defined in the GHE schema.
conf/types.json
Outdated
"transportProtocol": ["protocol"], | ||
"severity": ["severity"], | ||
"environmentIdentifier": ["envIdentifier"], | ||
"roleIdentifier": ["roleIdentifier"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a custom decoration that only we do - remove
conf/types.json
Outdated
"filePath": ["path"], | ||
"transportProtocol": ["protocol"], | ||
"severity": ["severity"], | ||
"environmentIdentifier": ["envIdentifier"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a custom decoration that only we do - remove
conf/types.json
Outdated
"command": ["cmdline", "command"], | ||
"message": ["message"], | ||
"sourceAddress": ["host", "source", "local_address", "address"], | ||
"destinationAddress": ["destination", "remote_address", "gateway"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to account for values like ::1
when you write the validation classes
conf/types.json
Outdated
"destinationAddress": ["remote_address"], | ||
"sourcePort": ["port"] | ||
}, | ||
"osquery": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "fileHash": ["md5", "sha1", "sha256"]
(see https://osquery.io/docs/tables/#hash)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add sourceUserId
, have the array contain uid
(see https://osquery.io/docs/tables/#users)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add receiptTime
, have the array contain unixTime
(all osquery logs have this field, it denotes when the info was collected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add fileSize
, have it contain size
. ex: https://osquery.io/docs/tables/#file_events
conf/types.json
Outdated
}, | ||
"osquery": { | ||
"userName": ["username", "user"], | ||
"filePath": ["path"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add directory
to the filePath
array (see https://osquery.io/docs/tables/#hash)
* Add a configuration file conf/types.json * Add data normalization logic in rule processor * Add a sample rule for integration test * Add unit test cases
We defined 24 normalized types: "account" "agent" "cluster" "cmd" "domain" "event_name" "event_type" "hashmd5" "host" "ipv4" "msg" "name" "os" "path" "port" "process" "protocol" "region" "role" "score" "sev" "user_type" "username" "vend"
Refactor two methods to be able to traverse all nested keys. * `fetch_values_by_datatype` * `match_types`
* Use CEF stardard to define normalized types.
* The bug is introduced when return of method `match_types` mixing with string and list, e.g. ['key1', ['key2', 'subkey2']]. * The solution is to enforce return to be [list1, list2, list3, ...]
d6329b9
to
f97a6a7
Compare
Changes Unknown when pulling d2f2ebf on new_data_normalization into ** on master**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @chunyong-lin a few small comments I would like you to address before merging, but I am stamping!!
Congrats on your first big contribution - this is going to be a HUGE!
return results | ||
|
||
@classmethod | ||
def update(cls, results, parent_key, nested_results): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a description of the args to this function's docstring?
|
||
def test_fetch_values_by_datatype(): | ||
"""Helpers - Fetch values from a record by normalized type""" | ||
rec = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the unicode (u'..
) prefix from these strings, or are they required for your tests? They typically aren't necessary unless you explicitly need the strings as unicode. Please also remove from your test on line 125 below if you can.
results = dict() | ||
for key, val in record.iteritems(): | ||
if isinstance(val, dict): | ||
nested_results = cls.match_types_helper(val, normalized_types, datatypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done! 😸
in nested_results are original keys of normalized types. | ||
nested_results (dict): A dict of normalized_types from nested record | ||
|
||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chunyong-lin this function doesn't actually 'return' anything. please remove this since it alters a dictionary by reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks!
78a313b
to
e6add10
Compare
to @ryandeivert @mime-frame
cc @austinbyers
size: medium
contributes to: #105
Background
Currently, logs from different log sources (CarbonBlack, Osquery etc) can capture the same types of information, but using different schemas or different key names. Thus, it requires writing a rule for every log source and referencing the specific key-names used.
Example:
Using example above, it requires to write a rule for each log if you want to alert on any suspicious use of
wget
. Example rules are like so:Data normalization feature will help you achieve writing one rule against all relevant logs. Applying data normalization, you can write one rule to alert on
wget
usage on logs shown above.We are using CEF (Common Event Format) standard to define normalized types. Please refer to CEF documentation (PDF) if you want to define your custom normalized types.
Changes
This configuration file defines normalized types for each log source. This example shows the normalized types define for logs from cloudwatch.
Testing
sourceAddress
was working correctlyeventType
was working correctlyuserName
was working correctlyUpcoming
ipv4
,domain
,cmd
, etc.).