Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance block coverage to correctly track coverage in presence of exceptions; Base mutant-test-pairs on blocks (not lines) #534

Merged
merged 4 commits into from
Aug 27, 2019

Conversation

jon-bell
Copy link
Contributor

We found that PIT can erroneously report mutants as "survived" when they should actually be NO_COVERAGE (and should never even be executed). This is because the initial block-based coverage collection assumes that no instructions throw exceptions, and hence, PIT will attempt to mutate instructions that are never actually covered, but are assumed to be covered (based on the block definition). I've created a test that shows this (VerifyBlockCoverageIT. shouldNotRunMutantsOnLinesUncoveredByExceptions).

We also noticed that PIT was deciding on which test-mutant pairs to run based on line coverage, rather than block coverage (even though block coverage is what is collected). I've created a test that shows this too (VerifyBlockCoverageIT. shouldNotRunMutantsOnCoveredLineButNotCovered).

This PR resolves both of these issues by:

  1. Modifying the ControlFlowAnalyser to consider instructions that might throw an exception as ending a basic block (for coverage purposes).
  2. Refactor the block counter that is used in mutation running to use the same exact notion of block (the implementation previously did not match up exactly, and hence, the block index for the same instruction would differ in the coverage minion and the mutation minion)
  3. Modify the test prioritiser to use block index to determine where to insert a mutant rather than line number.

I've also modified many of the unit/integration tests that had a tight coupling on line number/block number to match with this new approach.

Anecdotally this does not seem to have a significant performance impact - but if you have performance concerns and a preferred way to measure coverage/mutation performance I'm happy to do optimization as necessary.

…sic block boundaries, consider instructions that might end a basic block due to throwing an exception (e.g. invoke a method which throws an exception, ending the block). When matching mutants with coverage, use this block information instead of line information. Refactor the block boundary detection for mutation minions to use the same code as the coverage minions for consistency.
@hcoles
Copy link
Owner

hcoles commented Nov 14, 2018

@jon-bell Thanks for this, these changes will need a little thought.

I've not looked at the code, but from what you've said you have effectively reduced the size of a block to a single instruction, so pitest would be running against something roughly akin to statement coverage.

Block size is a trade off. The larger the block the fewer probes are required, reducing the overhead of calculating coverage. Smaller blocks increase the accuracy of the test targetting when exceptions are present, potentially reducing the cost of the mutation analysis stage.

As this is a trade off there isn't a right answer. Large blocks will provide better performance for some code bases, smaller blocks will provide better performance in others.

When I first put this together (which is a long while back now) the "correct" answer to this trade off was easy, my day job was to evolve a large legacy codebase so whatever worked out fastest for that codebase was the right answer. The right answer now is less clear as pitest has a wide and varied user base and I no longer work on that project.7

Another factor that has to be weighed into the tradeoff is complexity, both of the code and the concepts: the current definition of a block is easy to understand and explain, a definition where a block == a statement is also easy to understand. Anything in between much less so.

Switching to targetting mutants by block rather than line also needs a little investigation.

The last time I had chance to play with the coverage implementation the plan was to do this, as you can see from the code it didn't happen. There's a horrible mix of lines and blocks with maps and mess all over the place.

From memory the reason for this was some nasty corner cases where targetting by block would result in behaviour change - pitest uses the line number debug information to identify where the compiler has inlined code for finally blocks and try with resources. I think (but am not certain without finding time to delve into this again) that this would be impacted if mutants were properly targeted by block.

So where does that leave us?

If possible the three changes need to be separated so one does not block the other. Improvements to tests ought to be easy to merge in. If there is not an issue with finally blocks etc then switching to block targetting should also be easy.

Although performance will be a concern for the block size change, a bigger factor will probably be considerations of any complexity it introduces so I'll need to find time to take look at a working version of those changes (currently it seems to be buggy, I get an index out bounds exception if I run the branch against https://github.com/hcoles/truth).

Then we'd need to measure any performance impact.

Unfortunately Pitest doesn't currently have a proper performance test suite. We'd need to select a corpus of real projects to use as reference points. At least some of them need to be large with long running test suites.

@jon-bell
Copy link
Contributor Author

jon-bell commented Nov 15, 2018

I've not looked at the code, but from what you've said you have effectively reduced the size of a block to a single instruction, so pitest would be running against something roughly akin to statement coverage.

Technically it's not a single instruction, but indeed there are now many more instructions that can end a block (any method invocation, division, checkcast, anewarray, etc).

Block size is a trade off. The larger the block the fewer probes are required, reducing the overhead of calculating coverage. Smaller blocks increase the accuracy of the test targetting when exceptions are present, potentially reducing the cost of the mutation analysis stage.

Before getting into doing performance measurements on PIT, I'll mention that JaCoCo puts probes after method invocations (but not after the other instructions that I threw in), we found it to have ~35% overhead on average across 17 OSS projects that we studied.

Also - it would be very easy to set the block size with a flag and let users choose.

The last time I had chance to play with the coverage implementation the plan was to do this, as you can see from the code it didn't happen. There's a horrible mix of lines and blocks with maps and mess all over the place.

I wouldn't say it was horrible :)

From memory the reason for this was some nasty corner cases where targetting by block would result in behaviour change - pitest uses the line number debug information to identify where the compiler has inlined code for finally blocks and try with resources. I think (but am not certain without finding time to delve into this again) that this would be impacted if mutants were properly targeted by block.

As far as I can tell, the try/finally handling should still work the same (that will still go based on line numbers, and we still correctly map from block to line number). If you can think of cases for which there aren't existing tests I can take a look.

Although performance will be a concern for the block size change, a bigger factor will probably be considerations of any complexity it introduces so I'll need to find time to take look at a working version of those changes (currently it seems to be buggy, I get an index out bounds exception if I run the branch against https://github.com/hcoles/truth).

It was a stupid auto-complete-induced bug in my patch (getDescription vs getMethodDescriptor), now fixed. For what it's worth, on this project, coverage collection takes 2 seconds either way. The same number of mutants were generated and survived, but less tests were run:
(this branch):

Generated 727 mutations Killed 637 (88%)
Ran 1489 tests (2.05 tests per mutation)

(master):

Generated 727 mutations Killed 637 (88%)
Ran 1561 tests (2.15 tests per mutation)

We'd need to select a corpus of real projects to use as reference points. At least some of them need to be large with long running test suites

I'm happy to set up and run this. Here are 33 projects that we've been poking at with PIT for other purposes, I can easily run both versions on all of these and report back (please suggest others too, if anything comes to mind). Their total testing time each (without PIT) generally is less than 30-40 minutes.
achilles, assertj-core, cloudera.oryx, commons-collections, commons-configuration, commons-dbcp, commons-exec, commons-functor, commons-io, commons-jxpath, commons-net, commons-validator, cors-filter, cucumber-reporting, dropwizard, empire-db, exp4j, handlebars, handlebars.java, hector, httpcore, jimfs, jsoup, logback, ninja, okhttp, oryx, raml-java-parser, retrofit, tachyon, togglz, undertow, wro4j

@hcoles
Copy link
Owner

hcoles commented Nov 17, 2018

I'm a lot more comfortable with the idea of ending a block on just a method invocation rather than on a subset of instructions.

This would still greatly reduce the size of a block, but the definition of a block would still be easy to understand, it should require very little change to implement and, I suspect, provide 80% of any benefit in terms of targetting mutants (in the context of code executing as part of a unit test I think the most common point at which exceptions would be thrown would be method invocation).

I've just taken a look to remind myself what I believe the issue is with finally blocks.

Finally block are implemented by the compiler rather than the JVM. For code in finally blocks two copies of the bytecode are generated, one for normal execution, one for exceptional execution.

Pitest sees this as separately mutable code. If a test hits only the non exception copy the exception mutation will survive and vice versa. The mutants are however displayed on the same line (as they come from the same code). To match the behaviour of a source code level bug, if one mutant is killed then its copies should also be killed.

To work round this org.pitest.mutationtest.build.intercept.javafeatures.InlinedFinallyBlockFilter looks for identical code, with identical line numbers, that appears in different blocks. When it finds them it combines the two mutants into one.

This works with line number based coverage as any tests that hit the line will be run against the higher order mutant. I don't believe it will work with block coverage (but there is no failing test so I am either wrong or never wrote that test) as the mutant reports itself as only being in one of the two blocks.

I think there would be three ways to fix this.

  1. Change the model so a mutant can describe itself as being in more than one block
  2. Target by instruction instead
  3. Change the way finally block handling works so that the mutants are assessed individually then filtered when the result is known: just the one with the most favourable result would be reported.

I think the first option would require lots of very unpleasant re-work.

The second option is probably both easier and better. Multiple instructions are already captured in the model
(for historic reasons they're called "indexes" in MutationIdentifier).

Then all that's needed is something that maps blocks to instructions.

The third option is perhaps the cleanest as it allows the model to be simplified so a mutant is always at a single instruction, it would however require a mechanism to delay reporting the result of one mutant until the results of other had been received, which could be messy.

… integration test to make sure that mutants in finally blocks are run on the correct tests, and reported only once
@jon-bell
Copy link
Contributor Author

Thanks for the suggestions. I figured that option 3 is interesting, and probably best for further extensions in the future, but felt much more confident that I'd get it right going with the second. I've changed it so that now mutants are targeted by instruction, and I've added a new integration test that demonstrates this (the existing test suite wasn't sensitive enough to find this bug).

Regarding block granularity: are you OK with now making each method invocation end the block by default, and having a flag to also include other exception-flowing instructions? If so: should that configuration flag also get exported into the mutation/line coverage files (to simplify parsing down the road)?

@hcoles
Copy link
Owner

hcoles commented Nov 18, 2018

I'd prefer to stick to a single non configurable definition of a block. Making it configurable adds complexity and new edge cases (e.g history files become invalid if the definition is changed).

As for for the size of that block, we'd need to run some experiments as I said before, but one based on blocks ended by method calls is looking promising. I've just run this branch on a couple of codebases and it's shaved a couple of minutes off the overall time while only adding a few seconds to the coverage stage.

These were both "old school" code bases with large methods and some exception based logic flow. I wouldn't expect there to be the same kind of saving with more modern code, but I may well be mistaken.

It may also be possible to shave some of those seconds back at the coverage stage. Pitest has an optimisation that kicks in for methods with 16 blocks or fewer, using local variables as probes. Above 16 blocks a single array is used to store probes, which make each probe hit very slightly more expensive.

With the new block definition a large percentage of methods will now have more than 16 blocks, expanding the optimisation out to some higher figure (32?) could have a measurable effect, although I suspect at some point it will become a negative optimisation.

@jon-bell
Copy link
Contributor Author

I'd prefer to stick to a single non configurable definition of a block. Making it configurable adds complexity and new edge cases (e.g history files become invalid if the definition is changed).

Sure - sounds very fair.

I've just run this branch on a couple of codebases and it's shaved a couple of minutes off the overall time while only adding a few seconds to the coverage stage.

Awesome!

It may also be possible to shave some of those seconds back at the coverage stage. Pitest has an optimisation that kicks in for methods with 16 blocks or fewer, using local variables as probes. Above 16 blocks a single array is used to store probes, which make each probe hit very slightly more expensive.

Speaking of performance, I was curious about this: I haven't done any profiling/benchmarking between the two approaches, but I am very familiar with JaCoCo's probe instrumentation strategy, and was looking through the PIT code in comparison.

It looks like right now in each method when PIT records coverage it needs to retrieve the probe array from the CLASS_HITS HashMap. This is a non-trivial overhead and could be replaced with a static boolean[] field per-class, which gets initialized when the class is loaded (e.g. calling CodeCoverageStore.registerClassProbes). This is what JaCoCo does in its ClassFieldProbeArrayStrategy (which I think is used everywhere but , although I am not sure why you couldn't do the same thing there). Then, in each method, rather than having a local boolean[] or set of boolean local variables, PIT could directly update the probe array.

I'm not sure exactly how much changing this would improve things, but especially in short methods, the HashMap lookup has to be a pretty big bottleneck. If you are interested, I am happy to take a look at this also (although presumably in a separate PR).

@hcoles
Copy link
Owner

hcoles commented Nov 19, 2018

The CLASS_HITS hashmap etc dates back a long way so it there because thats-how-its-implemented rather than because it is necessarily the best strategy.

That said, one major difference between jacoco and pitest is that pitest only ever collects per test coverage. Most of the time jacoco only wants to dump coverage when the jvm shuts down and I'd imagine it has been optimised with that in mind.

If pitest stored hits in class fields it would need to gather data from every instrumented class in the jvm at the end of each test. I don't know the cost of that as I've never implemented it, but it sounds expensive. Possibly more expensive than the combined cost of all the hits to the hashmap, also possibly not.

There could also be some thread safety concerns. The last time I looked at the jacoco codebase I was convinced that they didn't establish a happens-before relationship between write and reads to the fields, so writes to probes might not be visible when the data is read. Not a problem for a single read on jvm shutdown, but potentially an issue when reads are more frequent.

Initially pitest wrote directly to the map each time a probe was hit, and that was indeed really slow. So at some point this was changed to only happen once at the end of the method call, using local variables as probes.

Local variables writes are cheaper than field writes so, in some limited circumstances, this strategy could actually be faster.

Its a long time since I looked, but when I last compared pitest's coverage to jacoco and cobertura, pitest was able to gather per test coverage about 20% slower than jacoco could gather one shot coverage, and about 200% faster than cobertura could gather anything (I understand cobertura has improved a lot since then).

I don't know how it compared to jacoco for per test coverage back then, or how it compares now. I suspect jacoco is still faster, but I'd hope not by too much.

If you'd like to have a play at speeding the coverage feel free, but yes, it would need to be a separate PR.

@jon-bell
Copy link
Contributor Author

Here are some statistics from collecting coverage from commons-math on my laptop:

Jacoco (dumping per-test): 70 seconds
Baseline (Pit 1.4.3): 161 seconds
Precise-blocks (assume ONLY invoke* can throw exception): 178 seconds
Precise-blocks (complete exception handling): 242 seconds

Things look a lot better with the optimizations discussed above (#545):
Coverage-optimized (no changes to blocks): 73 seconds
Precise-blocks (assume ONLY invoke* can throw exception): 78 seconds
Precise-blocks (complete exception handling): 94 seconds

@jon-bell
Copy link
Contributor Author

Here's a little more background: Our motivation for these changes was a study in which we tried to track down mutants that were being generated by pit but were not being killed. We found that in many cases, rather than having equivalent mutants, or inadequate test suites, those mutants weren't being killed because they were never even covered when the mutant was run on the test (maybe they were "not viable"). Initially, we thought that this was because the tests were flaky (non-deterministic). Upon investigation, we found that many mutants that were generated by pit were not killable, because pit thought that those mutants were in instructions that were covered by test cases, and in fact, they weren’t.

I don’t have data recorded showing the precise contribution of each additional block-ending instruction, but can say with very high certainty that our patch is complete - and now results in every generated mutant truly being viable.

I think that the biggest change here (rather than just changing where blocks end to be method calls) is the change to target mutants to instructions instead of line numbers - this required much more invasive changes (just changing where the blocks ended was pretty straightforward), but I think that changing the targeting to instructions is even more important to ensuring that the generated mutants are viable (covered by the tests that we run them on). This also has the positive effect that instruction indices match up identically (with this PR) between the coverage file and the mutant report file.

@jon-bell
Copy link
Contributor Author

jon-bell commented Jul 8, 2019

@hcoles is there anything that I can do to help get this merged in?

@hcoles hcoles merged commit cdcf7a4 into hcoles:master Aug 27, 2019
@hcoles
Copy link
Owner

hcoles commented Aug 27, 2019

@jon-bell Sorry for the very long delay on this.

I've merged in the change, but with blocks that end on method calls rather than each instruction for the reasons previously discussed. From a small sample of projects this seems to provide roughly the same benefit in terms of performance as the much smaller blocks.

Although I previously said I'd prefer not to make the size configurable, if there's a need for the smaller blocks from a research point of view I'd be open to the smaller block size being made a configuration option.

@jon-bell
Copy link
Contributor Author

Thanks, this is great news, and is very helpful for us!

I will not push hard at this point to make the size configurable - given that the rest of the changes are merged in, it would not be hard to configure this myself in a fork. However, if you are willing to add it behind a flag I wouldn't say no :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants