Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster, collision-free coverage instrumentation #171

Merged

Conversation

jon-bell
Copy link
Contributor

While implementing novel guidance improvements to JQF + Zest (to be described in our soon-to-be-published CONFETTI paper at ICSE), we found that the default coverage implementation that JQF was unusually slow. From profiling, we found that the current coverage implementation that JQF provides uses the Janala instrumentation framework, which provides a generic framework for recording and inspecting various program events, and for this flexibility, imposes a significant performance burden. We also noted issues where our more targeted guidance would fail to make progress on particular branches because the newly covered branches weren't detected as covered, since their probes collided with other branches.

This PR includes a complete coverage implementation that is both faster than the current implementation provided by JQF, and is also collision-free. It is enabled by setting the flag -DuseFastNonCollidingCoverageInstrumentation=true. There were several changes that I had to make to existing code to enable this extensibility - most notably introducing the ICoverage interface, and refactoring uses of Java collections with boxed primitives (slow to box and unbox for every branch execution) to instead use Eclipse primitive collections.

I conducted a 24 hour x 20 trial evaluation comparing the performance of this coverage implementation to the JQF implementation. In terms of execution speed, the difference is quite dramatic, with the fast-collision-free-coverage implementation executing roughly 7-10x as many inputs in the same 24 hour period.

Note that because there are no collisions on probes, the branch probes vs time graph provides a built-in advantage to fast-collision-free-coverage (there are more probes that can be covered), but I have also included JaCoCo coverage results after reproducing the entire saved corpus of each fuzzing run. JaCoCo branch coverage was same-or-better in all cases for fast-collision-free-coverage.

Note that on Maven, the revision of fast-collision-free-coverage that I executed did not include the recent patch to increase the generator coverage - it underperforms the normal coverage implementation only for that reason.

I do not have an automated pipeline to deduplicate the failures and determine the intersection and unique crashes found in each campaign, but the failures are all linked from the report.

I am not sure what the downside, if any, is to making this the default coverage implementation - I suspect that some changes may be needed to other Guidance implementations to be sure that they work correctly still beyond running the test suite. However, I think having the option to use this instrumentation might be a big win for future development that adds more nuanced guidance.

If you have any suggestions, I would be happy to incorporate them and regenerate the evaluation. Thank you for building and maintaining this valuable infrastructure!

This commit introduces new coverage collection instrumentation that avoids the Janala trace framework, and instead directly adds coverage probe calls without requiring additional object allocations, and without instrumenting any instructions that don’t have branch probes . This coverage implementation is also collision-free, and can be enabled if desired using the flag -DuseFastNonCollidingCoverageInstrumentation=true
@rohanpadhye
Copy link
Owner

Woo-hoo! This looks awesome. I can certainly see the issue with the current coverage instrumentation and how primitive collections, etc. would improve performance. The evaluation results are very promising. Thanks so much for the comprehensive set of results.

I can't immediately think of any downsides/risks to switching the coverage data structures either (beyond some downstream disruptions for subclasses of various Guidances). Adding a dependency on Eclipse collections is fine for me. I'm just going to go through the code changes to estimate the impact, if any. I'll also page in some of my grad students who have done similar tinkering in other contexts. Hope to merge this in the next few weeks!

@rohanpadhye rohanpadhye self-assigned this Dec 30, 2021
Copy link
Collaborator

@aoli-al aoli-al left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jonathan,

I'm a PhD student working with @rohanpadhye on JQF. Thank you for making these changes and the experiment result looks very promising!

I left a few comments according to the implementation and the rest of part LGTM. I do notice that the new implementation will affect the EI guidance since it relies on the instrumentation to compute EI but I'm happy to change EI guidance to support the new FastCoverageSnoop.

@jon-bell
Copy link
Contributor Author

Cool! I wasn't sure what the expected results were for EI. I am happy to help with those changes, and would also be happy to run it to my evaluation cluster.

@jon-bell
Copy link
Contributor Author

jon-bell commented Jan 1, 2022

Here's the updated 24 hour evaluation - 71c1cb is the revision that mirrors the fix in this branch's 65713f7, including the maven generator fix (it differs only in that it has CI evaluation scripts): https://ci.in.ripley.cloud/logs/public/jon-bell/JQF/71c1cbed02756140b3b663846caf83a01be58baa/Gold%20evaluation%20-%2024%20hours,%2020%20trials/1641385165/1/site/

Reducing coverage collisions has little-to-no impact on targets where the search space is already quite saturated due to the limited scope of the generator (e.g. ant, bcel, maven), but has a more noticeable impact on closure and rhino.

@rohanpadhye
Copy link
Owner

rohanpadhye commented Jan 9, 2022

Thanks for the updating evaluation! Looks great. We were looking at the results and wondering why there's so much of a difference between branch probes covered (e.g. ~880 for Ant), JaCoCo branches covered (e.g. ~21,000 for Ant), and our expectation of how many branches should be covered based on previous evaluations of Zest. So @aoli-al dug around your repo to find the plotting script and found the relevant ggplot, which seems to source data from a variable cov that gets its value from paths_total. Not sure if this is intentional or not, but paths_total in JQF's plot_data CSV file represents just the size of the input corpus saved so far---the CSV format itself is derived from AFL's own plot_data format, which uses AFL's confusing terminology (which, thankfully, is being revamped by AFL++). This metric is not equivalent to branch coverage, though there should be some correlation.

tl;dr -- (1) Can you confirm or double check if "Branch probes over time" is actually plotting branch probes? (2) Regardless of whether this is a bug or not, the JaCoCo branch coverage and fuzzing throughput definitely improve, so we are still on track to merge this PR.

@jon-bell
Copy link
Contributor Author

jon-bell commented Jan 9, 2022

Woof. Thanks for the debugging help here - I have separately been trying to figure out what is going on with these plots because they were showing JQF outperforming CONFETTI on branch probes, despite the fact that our earlier experiments (not using these scripts) showed the opposite, and JaCoCo results showing the opposite too. I just added two new columns, all_cov and valid_cov to the plot_data file. The earlier experiments had been using our fork of JQF1.1, which we had also added these columns to, and hence the graphing scripts that we had used before also used the all_cov column.

Unless you already have some scripts to generate coverage-over-time graphs based on reproducing the results from the corpus (this seems tricky to do based on what is logged), I'll re-run the entire 24 hour evaluations for both the mainline JQF + with this feature to collect the new plot_data files, then update the scripts to generate the graphs based on this column.

@rohanpadhye
Copy link
Owner

Yeah I don't think I've run a coverage-over-time plot using plot_data in a while (and can't find such a script in the repo); but my guess is that the last thing we used would have taken the map_size/valid_cov columns and multipled it by MAP_SIZE. This would of course be subject to undercounting branches because of hash collisions.

The more precise way we've done this is to use JaCoCo to cumulate coverage of each saved input after the fact and use the plot_data to figure out when the corresponding input was saved to make a over-time plot. I'll share the script if I can find it.

@rohanpadhye
Copy link
Owner

Your approach of adding two new columns is certainly easier + cleaner, and I'd be happy to include it.

… the failure; Add a branch probe before each call site
@jon-bell
Copy link
Contributor Author

👋 So, once I added the correct logging to record branch probes in plot_data, I of course got to spend all week debugging and comparing to figure out why my implementation was hitting fewer probes. I noticed that JQF adds a CallEvent, which as far as I can understand, is interpreted as a unique branch probe, where each call site receives its own branch probe. I've now added that to my branch - can you confirm that this is the correct intended behavior?

The fuzzing report for the most recent revision that I've pushed has all of the details.

FYI, I also found that there are a lot more "unique failures" being found now, in particular, Closure might find 18k failures on a single 24 hour run, up from 4k. I have not looked into any of the failures to try to de-duplicate them and see which, if any are new failures, but noticed that this may be related to growing memory consumption - I added some small optimizations to reduce heap usage when collecting failures, but still saw out-of-memory exceptions at 4GB heap sizes (the default heap assigned by OpenJDK on my 16GB ram VMs). I am running it with 12GB heaps now, and it's running without crashing, but this might be something that is worthwhile to investigate in the future to reduce the memory bloat.

@rohanpadhye rohanpadhye added this to the 2.0 milestone Jan 17, 2022
@rohanpadhye
Copy link
Owner

Good catch! Thanks for the memory optimization for failure tracking.

I'm wondering why you chose to move the call tracking logic out of visitCode and into visitMethodInsn.

Let me give some background on the CallEvent tracking for context. Instrumenting calls in JQF has had two purposes: (1) due to dynamic dispatch, virtual method calls are a form of conditional branching that should be tracked by any coverage-guided fuzzer, (2) for the specific case of EI --- which is quite undocumented so not expecting most users to know about --- we want to be able to match calls and returns so that we always know the current state of the call stack.

By instrumenting the start of a method body for METHOD_BEGIN, we both (1) get the actual invoked target as a result of dynamic dispatch and (2) we only emit CallEvents when we are guaranteed to emit corresponding ReturnEvents, because we know the method body is instrumented. An asymmetric handling of calls/returns can be problematic if we generate a CallEvent for a call to a function that is in an uninstrumented class, and we don't see the corresponding return. There's also a fair bit of logic to handle exceptional returns and so on, which I'm sure you've also seen and dealt with.

Anyway, most of these issues should not concern the fast coverage instrumentation approach that your PR is going for, since there is no call stack tracking. So, I don't see a problem in instrumenting call sites instead. I'm just curious if you had a specific reason for doing so that I haven't already thought of.

@jon-bell
Copy link
Contributor Author

I did not see any logic that combined the call site with the result of the dynamic dispatch, so I did not attempt to implement that. I also might not have understood the code correctly - but I guess that this also could be a bug if I understood it right, but it is not the expected behavior. My understanding of the CallEvent handling is, roughly:

  1. Add INVOKE_****** at a method dispatch. Create a unique IID for the call site.
  2. Add METHOD_BEGIN at the start of each method
  3. At runtime, hitting the CallEvent sets the invokeTarget on ThreadTracer, so the IID of that invokeTarget is the unique IID created for the call site
  4. Then, when the METHOD_BEGIN fires, ThreadTracer emits a CallEvent with the IID of the call site
  5. In Coverage, process the callEvent by incrementing the branch probe for the call site

I moved the call tracking logic out of visitCode and into visitMethodInsn because it seemed to provide the closest analog to this existing behavior, although as you point out, it also creates branch probes for call sites that end up resolving to non-instrumented code.

While it is indeed technically an edge in the program's control flow graph, I think that the question of "exactly which edges to include in edge coverage" is a very open question, with little concrete evidence and much speculation. For example: if an instruction could throw an exception, and it doesn't, then it might be interesting for a fuzzer to know that knowledge (similarly, if an exception is thrown, it might be interesting to know the edge that was traversed to reach the exception handling target). My experience with Java coverage tools thus far (e.g. jacoco, cobertura, pit) is that there is generally a preference for more performance coverage implementations at the risk of missing some of those edges. Tracking edge coverage between callers and callees is a (relatively) extremely expensive operation, because it requires maintaining some sideband information about callstacks, and there are no efficient ThreadLocal in the JVM.

@rohanpadhye
Copy link
Owner

Hmm you are right that the current Coverage data structure does not actually use the target method info. It looks like there is infrastructure in place to get that (CallEvent.invokedMethod), but it does not seem to be used for coverage---perhaps it should be used similar to the arm of a branch? And of course, the whole discussion of coverage granularity is definitely speculative. You also made me realize that ZestGuidance does not track coverage for exceptional control-flow---now I'm wondering whether this should actually be there, because we probably don't get any signal for those cases 🤔 .

I like the general approach of performance over accuracy in this PR. In the long run (no pun intended), that will probably matter more. Thanks also for the updated experiments with the call site probes and memory optimization.

I'm going to start a 2.0-SNAPSHOT soon and merge this PR into that stream, along with a bunch of other non-trivial changes that are in the pipeline.

@jon-bell
Copy link
Contributor Author

Sounds great! I'm not sure if you have a mailing list of JQF users, but it is probably worth pointing this out to anyone who is using Zest to run 5-10 minute fuzzing campaigns in CI. It seems like coverage in a longer campaign is not particularly impacted by this change due to saturation of the code reachable from the generators - but the performance improvement makes a much more noticeable difference at the start of a campaign.

I'm not sure what other kinds of refactoring your are planning, but if you are interested in continuing to explore the coverage probe placement situation, it will probably be worthwhile to consider migrating entirely from the AFL-style fixed size probe array to the Eclipse IntHashMap used by this branch - as you add more and more probes collisions will become more and more problematic, and it is not particularly expensive to avoid them.

@rohanpadhye rohanpadhye merged commit 4ecee4b into rohanpadhye:master Feb 23, 2022
@rohanpadhye
Copy link
Owner

Merged this to the working 2.0-SNAPSHOT. Thanks again for the contribution, @jon-bell !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants