-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce cargo kani assess scan
for analyzing multiple projects
#2029
Conversation
Ah, the run finished a bit early, about 5 hours instead of 7 (thanks to --release)
The old script had roughly these results:
We see lots more intrinsics, more instances, more crates impacted, more crates failing. I suspect a lot of this comes down to more crates being successfully identified (found at all by the tool, every crate in the workspace built, etc). I can try to control for that, but I'm curious if I need to spend the time on that... I think this may be sufficiently explained! |
There's lots to process here, so I'll provide a longer review later, but I think we need to spend more time on defining what exactly it is we are measuring here. Some questions:
|
Questions:
|
FYI, #2032 may have an impact on the number of |
These crates don't use Kani library though. So unfortunately I don't expect any change. :( |
I was also wondering what is the use case for this feature. Should it be part of kani-driver or a utility tool built on the top of |
The data set from that file is 83 git repos. These were the source of the "top-100 crates" we have been analyzing. Assess finds 219 packages in them. The old script finds fewer (it would take some work to find out how many exactly...) For instance, the old script just runs
It is not necessarily more efficient. The MIR linker worked wonders for most of our customers because the reachability graphs are typically so small. But something in some of this code that we're analyzing seems like it's triggering huge reachability graphs for some reason. Combine that with having several integration tests under
I don't understand if this question is asking something else, but I think it was probably answered above?
Not directly, no. The next step we want here is to do things like run the tests, so I don't think we want that. But I do think it'd be a good idea to maybe add a
Yep. Not being counted as separate packages, just increasing "instances of use" (and build times...)
Part of driver. There are several customers already where they have multiple rust projects that are not in a shared workspace. Anyone who wants to run assess on all of them at once will need this tool. Breaking it out to somewhere else is just extra complexity. |
What Celina said, though I suspect you're maybe on to something here for at least one reason why the reachability graphs get so huge for some of these crates. I've been wondering if all the stdlib backtrace machinery might be one cause... |
is this an apples to apples comparison between the script? Seems like this is a superset? Is it possible to do a side by side comparison to ensure we are not looking at amplified or spurious numbers? |
I see @tedinski point that the old script had some flaws in them that might not be worth trying to match just for comparison sake. Looking forward, I think we should have a clear definition on what is it that we want to measure. What kind of information are we trying to collect? For the top 100 crates, I was expecting up to 100 crates being reported. The unsupported features count should basically read as a percentage of these 100 crates affected by each construct. |
I'm trying to figure out what the simplest approach to doing something more side-by-side would be. My current plan is to add the filter mechanism to scan, then it only on the exact set of packages from the old script. I've been digging into the old script's output more and:
So I think if I filter down to those last 20, I'll have more comparable numbers, with the main difference being that assess-scan still starts from tests, while the old script with the legacy linker was starting from public functions. (And, of course, that with the MIR linker, we have the stdlib around, so numbers might change slightly there too.) |
Are you planning to change it to start from public functions instead? |
Not planning that presently. It's a feature that would only serve to get us closer to an identical comparison between the old and new scripts, but wouldn't be useful again as soon as we move on to the next step of trying to run all the tests. Here's the results on the 20 crates I identified from the old script that succeed and contribute unsupported features:
The differences (recall that there are 3 more crates being analyzed by the old script because their build fails, but it still counts their feature warnings):
|
Also, updated numbers of the overall (220 now) packages in the data set with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some test coverage? Even if in the form of a bash script.
@adpaco-aws I was wondering if we should create a mode in compiletest to run scripts or something like that. Maybe look for test.yml
.
/// The structure of `.kani-assess-metadata.json` files, which are emitted for each crate. | ||
/// This is not a stable interface. | ||
#[derive(Deserialize)] | ||
pub struct AssessMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add the name of the crate and a list of dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assess metadata is across multiple crates, so not really applicable here. Something like that might be nice for kani-metadata, or we could think about what we actually want to aggregate and present here somehow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we at least add the package name? This could be useful when reporting the successful/failed packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend, in a follow-up PR, to add another table here that might help with that.
But again, metadata is across multiple packages, so there isn't one name to put here. I'll add a better doc comment on this type.
let mut cmd = Command::new("cargo"); | ||
cmd.arg("kani"); | ||
// Use of options before 'assess' subcommand is a hack, these should be factored out. | ||
// TODO: --only-codegen should be outright an option to assess. (perhaps tests too?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about --verbose
, --default-unwind
, -p
, --workspace
, --target
and so on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, all this prepending stuff needs to go (imo even --enable-unstable
).
Data from the actual top-100: Using tests as reachability roots:
A hack to use pub-fns as reachability roots:
Notes:
I think the inclusion of the test machinery is a confounder. I'm going to re-do this, but not building in test mode. That means changing two variables in the comparison, however: regular build, not test build, and pub-fn reachability, not test reachability. |
Ok, I fixed the crate count bug I noticed above, and re-ran the analysis with the additional change of not building in test mode. Here's the results With test mode on, and using test reachability (default):
With test mode off, and using pub-fn reachability (hacked in for comparison):
The pub-fns mode shows a substantial increase in the number of crates failing. This is because of an existing bug in the pub-fns reachability which I've filed here: |
I've further tracked down why the old script was detecting more The old script incorrectly builds and counts packages. It attempts to build these repos:
And it counts them each as "one crate", but a normal build actually builds multiple packages within these workspaces, and so multiple Assess is actually not missing any of these instances, it's just a miscounting bug in the old script. |
Ok, current status:
I believe this is ready for review again. |
scripts/kani-regression.sh
Outdated
(cd bar && cargo clean)) | ||
EXPECTED_FILES=(bar/bar.kani-assess-metadata.json foo/foo.kani-assess-metadata.json bar/bar.kani-assess.log foo/foo.kani-assess.log) | ||
for file in ${EXPECTED_FILES[@]}; do | ||
if [ -f $KANI_DIR/tests/assess-scan-test-scaffold/$file ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the contents of the file to make sure nothing is broken? Perhaps diff them against an expected file? Or grep for certain lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with trying to add more ad-hoc testing steps here. There are at least a couple of checks implicitly happening here:
- If scan exits unsuccessfully, it fails.
- If something goes wrong with the individual assesses, the metadata files aren't emitted, even if scan somehow misses the failure.
- The details of the results shouldn't actually be any different compared to the assess tests that are in the
cargo-kani
suite. It'd be nice to check that, but it'd kinda be crudely emulating an expect test but with greps?
I think we should do better, but we're stacking up ad-hoc testing scripts that really need a more principled solution...
Co-authored-by: Zyad Hassan <88045115+zhassan-aws@users.noreply.github.com>
I believe I've addressed all PR comments. I also added a quick script to clone the top-100 and run assess on them, so that the process of running this is clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps in a follow-up PR, but it would be useful if cargo kani assess scan
can take a number of directories as arguments as opposed to needing to have all the packages cloned inside one directory.
I think it would be great if we provide more visibility to why there X failures. I.e.: |
Agreed! My plans on next steps here include adding a classifier for failures to build. (e.g. a few categories of other problems plus something that merges on "file:line of the panic in kani-compiler") Actually, I'll open an issue with that feature request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at #2047.
Thanks for collecting the data without test mode. If it's not too hard, I would like to be able to collect data without building things on test mode. I'm OK if that for now requires --only-codegen
. The test machinery significantly influence the results, but I'm not sure they will impact proof harnesses.
I also would prefer if we don't limit ourselves to things that are covered by tests.
That said, I don't think this is a blocker for this PR and it can be done as a follow up work.
docs/src/dev-assess.md
Outdated
Assess will normally build just like `cargo kani` or `cargo build`, whereas `scan` will find all cargo packages beneath the current directory, even in unrelated workspaces. | ||
Thus, 'scan' may be helpful in the case where the user has a choice of packages and is looking for the easiest to get started with (in addition to the Kani developer use-case, of aggregating statistics across many packages). | ||
|
||
(Tip: Assess may need to run for awhile, so try using `screen`, `tmux` or `nohup` to avoid terminating the process if, for example, an ssh connection breaks.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also mention running with a memory limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet tried memory limits, but I do (when not using --codegen-only
and running tests) actually put an explicitly lower -j
value so the machine doesn't run out of memory. I can at least mention both options.
docs/src/dev-assess.md
Outdated
|
||
Unimplemented features are not necessarily actually hit by (dynamically) reachable code, so an immediate future improvement on this table would be to count the features *actually hit* by failing test cases, instead of just those features reported as existing in code by the compiler. | ||
In other words, the current unsupported features table is **not** what we'd really want to see, in order to actually prioritize implementing these features, because we may be seeing a lot of features that won't actually "move the needle" in making it more practical to write proofs. | ||
Because of our operating hypothesis that code covered by tests is code that could be covered by proof, measuring unsupported features by those actually hit by a test should provide a better "signal" about priorities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I agree here. Conceptually, finding a failure that no existing unit test finds is exploring uncovered code. Also, wouldn't this already be reported in the test failure table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I agree here. Conceptually, finding a failure that no existing unit test finds is exploring uncovered code.
Right, but we aren't finding failures that no existing unit test covers?
Also, wouldn't this already be reported in the test failure table?
The test failure table is largely expected to see tests fail because of deficiencies in Kani, not something else. (We could conceivably find more failures, like MIRI does, but we don't implement a lot of Rust-specific undefined behavior yet, so this is less likely.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the runtime check. Wouldn't that be the reason why the test failed? Or is this reported as an assertion failure today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an unsupported feature or missing function that we hit, yeah, it comes back as a failed property. (Not necessarily with the 'assertion' property class.)
Unimplemented features are not necessarily actually hit by (dynamically) reachable code, so an immediate future improvement on this table would be to count the features *actually hit* by failing test cases, instead of just those features reported as existing in code by the compiler. | ||
In other words, the current unsupported features table is **not** what we'd really want to see, in order to actually prioritize implementing these features, because we may be seeing a lot of features that won't actually "move the needle" in making it more practical to write proofs. | ||
Because of our operating hypothesis that code covered by tests is code that could be covered by proof, measuring unsupported features by those actually hit by a test should provide a better "signal" about priorities. | ||
Implicitly deprioritizing unsupported features because they aren't covered by tests may not be a bug, but a feature: we may simply not want to prove anything about that code, if it hasn't been tested first, and so adding support for that feature may not be important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not convinced that's the case. I think test is a great starting point, but I don't think we should deprioritize things that aren't covered by tests.
We've talked about automatic implementation of harnesses before that wouldn't rely on tests. E.g.: Function that all its inputs implement the arbitrary type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not convinced that's the case. I think test is a great starting point, but I don't think we should deprioritize things that aren't covered by tests.
It is how assess works, and I thought I wasn't saying "this is how things ought to be" but "this is the operating hypothesis for assess" (see for instance that I explicitly wrote "may not be a bug, but a feature")
I guess I can clarify that this is the operating hypothesis for assess, not a team consensus. We want to look and see.
Description of changes:
cargo kani assess scan
, to run assess on multiple projects/workspaces at once and present aggregated results across all of them.-p
but avoided fixing it in this PR (becausetests/cargo-kani/ws-specified
relied upon it), added a test that should be fixed and referenced the relevant issue.Resolved issues:
Resolves #1922
Call-outs:
First and foremost, I don't plan to merge this PR until I'm able to present a comparison of the results between
scan
and the existingscripts/exps
script for measuring unsupported features in the "top-100 crates" (eg, the top-70 repos, containing 219 total crates, which includes the top-100. To be precise.) This run takes a long time to complete, however, and I'm awaiting results. I'll post them here when ready.We do expect differences in those numbers, from:
--workspace
nor--all-features
(which I had thought they did... and unfortunately I began the run with that enabled so this will be a difference in output.)Despite all this, based on previous trial runs, we actually get relatively similar output in the prevalence of unsupported features in this crate. So I'm not expecting surprises here.
I do want to note some other followup work I intend for later PRs:
cargo kani -p package assess
) is a hack, and I want to refactor the argument parsing so we can easily put the after the subcommand.--only-codegen
, except that performance is so slow this might take forever. I might try it over break. :)Testing:
How is this change tested? manually, see above
Is this a refactor change?
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.