Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[parsers] adding check for envelope_keys and validating if they exist #147

Merged
merged 2 commits into from
May 16, 2017

Conversation

ryandeivert
Copy link
Contributor

to: @airbnb/streamalert-maintainers
cc: @jacknagz, @mime-frame

Explanation of change

This is v1 of optimizations to the json parser. I'm looking into more changes, but this is a fairly easy one with great improvements in speed. The change is as follows:

  • Adding check for envelope_keys and if they are declared, the parser will validate that they exist before continuing to parse. If they are not declared, it will still attempt record extraction via json_path as usual.
  • In most cases, json_path is used in conjunction with envelope_keys (meaning envelope_keys isn't always present, but typically is). In instances where envelope_keys is not declared (in the case where there is no important information needed from the outer level keys), this code will not have any performance improvement. However, like I said, that's not usually the case.
  • This includes some other very simple nit code updates.

Some Background

We were seeing a performance hit with certain logs coming from carbon black that were causing timeouts in Lambda (even with the duration set to the max of 5min and a bump in memory/CPU).

  • After some digging through cloudwatch logs, I was able to zero in on one of the s3 files that was causing a timeout.

  • This led me to temporarily modify (read: hack) SA to handle testing a file locally on disk that I had downloaded from s3.

  • In order to accurately gauge where the slowdown was occurring, I used python's cProfile module like so (sorted by cumulative time):

    python -m cProfile -s cumtime stream_alert_cli.py lambda test --processor rule

For those unfamiliar, cProfile provides roughly same sort of insight into a python program that Xcode's Time Profiler (in Instruments) provides for c/objective-c/etc compiled applications. It outputs the time the program spent in functions, and the per-call time for said functions (and some other useful info). Technically, Time Profiler could also probably be used for python programs, but the call stack is much more difficult to sort through, etc.

This is the head of the output from cProfile, sorted by cumulative time:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  149.214  149.214 stream_alert_cli.py:26(<module>)
        1    0.000    0.000  148.411  148.411 stream_alert_cli.py:139(main)
        1    0.000    0.000  148.408  148.408 runner.py:42(cli_runner)
        1    0.000    0.000  148.408  148.408 runner.py:66(lambda_runner)
        1    0.000    0.000  148.408  148.408 test.py:384(stream_alert_test)
        1    0.000    0.000  148.408  148.408 test.py:99(test_rule)
        1    0.000    0.000  148.344  148.344 handler.py:34(run)
        1    0.054    0.054  148.342  148.342 handler.py:83(_s3_process)
    18294    0.059    0.000  147.928    0.008 handler.py:107(_process_alerts)
    18294    0.061    0.000  146.641    0.008 classifier.py:188(classify_record)
    18294    0.150    0.000  146.549    0.008 classifier.py:298(_parse)
    18294    0.521    0.000  145.758    0.008 classifier.py:252(_process_log_schemas)
    67860    0.232    0.000  136.539    0.002 parsers.py:226(parse)
    67860    0.350    0.000  133.425    0.002 parsers.py:151(_parse_records)
    11328    0.023    0.000  131.926    0.012 parser.py:13(parse)
    11328    0.029    0.000  131.890    0.012 parser.py:30(parse)
    11328    0.847    0.000  131.842    0.012 parser.py:34(parse_token_stream)
    11328    1.056    0.000  119.754    0.011 yacc.py:3214(yacc)
    11328    0.179    0.000   98.014    0.009 yacc.py:2100(__init__)
    11328   18.318    0.002   87.749    0.008 yacc.py:2533(lr_parse_table)
 11146752   20.751    0.000   34.268    0.000 yacc.py:2163(lr0_goto)

In the output above, everything below the following line is calls into external packages:
67860 0.350 0.000 133.425 0.002 parsers.py:151(_parse_records)

After looking in the _parse_records function in the rule_processor/parsers.py file, it was obvious what package we were calling that was causing this. Some observations:

  • The only possible culprit was jsonpath_rw, and it turns out there is a parser.py file in this library.
  • The real problem seems to be the yacc.py file, where our program spent 119s of the total 149s.
  • I was able to verify that yacc.py is a part of the ply python package, which is a dependency of jsonpath_rw.
  • Looking at the code in the lr_parse_table method of ply's yacc.py will make your head spin. The for loop in there is unlike anything I've ever seen before.

Upcoming Work

  • If possible, remove our dependency on jsonpath_rw completely.

Current Outcome

This takes processing of a problematic log down from:

real    1m50.695s
user    1m45.687s
sys    0m3.885s

to:

real	0m13.464s
user	0m12.806s
sys	0m0.509s

Copy link
Contributor

@jacknagz jacknagz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 it

@ryandeivert ryandeivert merged commit ee4c305 into master May 16, 2017
@austinbyers austinbyers deleted the ryandeivert-json-optimization branch June 24, 2017 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants