[parsers] adding check for envelope_keys and validating if they exist #147
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
to: @airbnb/streamalert-maintainers
cc: @jacknagz, @mime-frame
Explanation of change
This is v1 of optimizations to the json parser. I'm looking into more changes, but this is a fairly easy one with great improvements in speed. The change is as follows:
envelope_keys
and if they are declared, the parser will validate that they exist before continuing to parse. If they are not declared, it will still attempt record extraction viajson_path
as usual.json_path
is used in conjunction withenvelope_keys
(meaningenvelope_keys
isn't always present, but typically is). In instances whereenvelope_keys
is not declared (in the case where there is no important information needed from the outer level keys), this code will not have any performance improvement. However, like I said, that's not usually the case.Some Background
We were seeing a performance hit with certain logs coming from carbon black that were causing timeouts in Lambda (even with the duration set to the max of 5min and a bump in memory/CPU).
After some digging through cloudwatch logs, I was able to zero in on one of the s3 files that was causing a timeout.
This led me to temporarily modify (read: hack) SA to handle testing a file locally on disk that I had downloaded from s3.
In order to accurately gauge where the slowdown was occurring, I used python's
cProfile
module like so (sorted by cumulative time):python -m cProfile -s cumtime stream_alert_cli.py lambda test --processor rule
For those unfamiliar,
cProfile
provides roughly same sort of insight into a python program that Xcode's Time Profiler (in Instruments) provides for c/objective-c/etc compiled applications. It outputs the time the program spent in functions, and the per-call time for said functions (and some other useful info). Technically, Time Profiler could also probably be used for python programs, but the call stack is much more difficult to sort through, etc.This is the head of the output from
cProfile
, sorted by cumulative time:In the output above, everything below the following line is calls into external packages:
67860 0.350 0.000 133.425 0.002 parsers.py:151(_parse_records)
After looking in the
_parse_records
function in therule_processor/parsers.py
file, it was obvious what package we were calling that was causing this. Some observations:jsonpath_rw
, and it turns out there is aparser.py
file in this library.yacc.py
file, where our program spent 119s of the total 149s.yacc.py
is a part of theply
python package, which is a dependency ofjsonpath_rw
.lr_parse_table
method ofply
'syacc.py
will make your head spin. The for loop in there is unlike anything I've ever seen before.Upcoming Work
jsonpath_rw
completely.Current Outcome
This takes processing of a problematic log down from:
to: