Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce cargo kani assess scan for analyzing multiple projects #2029

Merged
merged 21 commits into from
Jan 4, 2023

Conversation

tedinski
Copy link
Contributor

Description of changes:

  1. Introduces cargo kani assess scan, to run assess on multiple projects/workspaces at once and present aggregated results across all of them.
  2. Introduce an "assess metadata" file, with an unstable format. Currently this is only used by scan to communicate results from multiple assess subprocess invocations (for each package/workspace).
  3. Update the argument parsing so the subcommands have their own arguments.
  4. I found a bug with -p but avoided fixing it in this PR (because tests/cargo-kani/ws-specified relied upon it), added a test that should be fixed and referenced the relevant issue.

Resolved issues:

Resolves #1922

Call-outs:

First and foremost, I don't plan to merge this PR until I'm able to present a comparison of the results between scan and the existing scripts/exps script for measuring unsupported features in the "top-100 crates" (eg, the top-70 repos, containing 219 total crates, which includes the top-100. To be precise.) This run takes a long time to complete, however, and I'm awaiting results. I'll post them here when ready.

We do expect differences in those numbers, from:

  1. Assess more accurately identifies projects in the git repo. For instance, if the root is not a cargo project at all and they are located in a subdirectory, assess finds it and the script does not. I can attempt to control for this some, however.
  2. The current scripts apparently don't do --workspace nor --all-features (which I had thought they did... and unfortunately I began the run with that enabled so this will be a difference in output.)
  3. Assess is using the MIR linker, not legacy linker, so we get the stdlib.
  4. Assess builds in test mode, so includes more code and dependencies.

Despite all this, based on previous trial runs, we actually get relatively similar output in the prevalence of unsupported features in this crate. So I'm not expecting surprises here.

I do want to note some other followup work I intend for later PRs:

  1. Passing options before subcommand (cargo kani -p package assess) is a hack, and I want to refactor the argument parsing so we can easily put the after the subcommand.
  2. I intend to write some documentation on using assess, to be placed under "dev documentation" for now (while it's unstable)
  3. I want to further automate the "git clone these repos" process, so scan can work just like the existing script. I felt it would be best to start here with the basic function and initial results, however.
  4. This is only the start of this feature, meant to replicate pretty closely our existing script, and we should then be able to quickly add more. For instance, it should be "simple" to run the tests by just not passing --only-codegen, except that performance is so slow this might take forever. I might try it over break. :)
  5. There is currently not a test for scan directly. I'm not sure what we want to do here. I'd like to add a nightly workflow for running this on the top-100, but the performance isn't good enough (too long 7+hrs, too much memory 12+GB). I'm kinda thinking about doing it with a cron job hack on my dev machine until we are able to make progress on the performance.

Testing:

  • How is this change tested? manually, see above

  • Is this a refactor change?

Checklist

  • Each commit message has a non-empty body, explaining why the change was made
  • Methods or procedures are documented
  • Regression or unit tests are included, or existing tests cover the modified code
  • My PR is restricted to a single feature or bugfix

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.

@tedinski tedinski requested a review from a team as a code owner December 21, 2022 21:08
@tedinski
Copy link
Contributor Author

Ah, the run finished a bit early, about 5 hours instead of 7 (thanks to --release)

Overall analysis time: 17186.314s
Assessed 165 successfully, with 54 failures.
======================================================
 Unsupported feature           |   Crates | Instances 
                               | impacted |    of use 
-------------------------------+----------+-----------
 caller_location               |      110 |       333 
 simd_bitmask                  |       54 |       230 
 try                           |       43 |      1727 
 Projection mismatch           |       34 |        83 
 TerminatorKind::InlineAsm     |       23 |        41 
 PointerCast::ClosureFnPointer |        9 |        44 
 'sqrtf64' intrinsic           |        3 |         6 
 'expf64' intrinsic            |        2 |         4 
 'logf64' intrinsic            |        2 |         4 
 'powif64' intrinsic           |        2 |         4 
 float_to_int_unchecked        |        1 |        24 
 'expf32' intrinsic            |        1 |         2 
 'logf32' intrinsic            |        1 |         2 
 'powf32' intrinsic            |        1 |         2 
 'powf64' intrinsic            |        1 |         2 
 'sqrtf32' intrinsic           |        1 |         2 
 'fmaf32' intrinsic            |        1 |         1 
 'fmaf64' intrinsic            |        1 |         1 
 'log10f32' intrinsic          |        1 |         1 
 'log10f64' intrinsic          |        1 |         1 
 'log2f32' intrinsic           |        1 |         1 
 'log2f64' intrinsic           |        1 |         1 
 'powif32' intrinsic           |        1 |         1 
======================================================

The old script had roughly these results:

--- OVERALL STATS ---
12 crates failed to compile
23 crates had warning(s)

SUMMARY - UNSUPPORTED FEATURES
=========================================================
Unsupported feature | Crates impacted | Instances of use
---------------------------------------------------------
                                simd_bitmask  |  20 |  26
                         Projection mismatch  |  10 |  10
                                         try  |   4 |  13
               PointerCast::ClosureFnPointer  |   3 |   6
                   TerminatorKind::InlineAsm  |   3 |   3
                         'sqrtf64' intrinsic  |   1 |   1
                         'powif64' intrinsic  |   1 |   1
                         'powif32' intrinsic  |   1 |   1
                          'logf64' intrinsic  |   1 |   1
                          'expf64' intrinsic  |   1 |   1
=========================================================

We see lots more intrinsics, more instances, more crates impacted, more crates failing. I suspect a lot of this comes down to more crates being successfully identified (found at all by the tool, every crate in the workspace built, etc). I can try to control for that, but I'm curious if I need to spend the time on that... I think this may be sufficiently explained!

@adpaco-aws
Copy link
Contributor

adpaco-aws commented Dec 21, 2022

There's lots to process here, so I'll provide a longer review later, but I think we need to spend more time on defining what exactly it is we are measuring here. Some questions:

  • "Crates impacted" goes to 110 for caller_location even though we're analyzing the top 100 crates. What is this number actually?
  • Assessed 165 successfully, with 54 failures. - Same question here, what's the unit for 165?
  • Why is it taking 5 hours to collect these metrics? It's using the MIR linker which in principle is more efficient, right? The old script didn't take more than 15 minutes...

@celinval
Copy link
Contributor

Questions:

  1. Do we account for the same repository for different crates?
  2. Would it make sense to pull the code from https://crates.io instead of pulling the repository? The repository could have crates that are not necessarily relevant from the "top X" perspective. E.g.: Some build or test infrastructure, like we have, but that should never be visible to users.
  3. Are we compiling integration tests to this mix?

@zhassan-aws
Copy link
Contributor

FYI, #2032 may have an impact on the number of caller_location unsupported feature warnings.

@celinval
Copy link
Contributor

FYI, #2032 may have an impact on the number of caller_location unsupported feature warnings.

These crates don't use Kani library though. So unfortunately I don't expect any change. :(

@celinval
Copy link
Contributor

I was also wondering what is the use case for this feature. Should it be part of kani-driver or a utility tool built on the top of cargo kani?

@tedinski
Copy link
Contributor Author

...even though we're analyzing the top 100 crates. What is this number actually?

The data set from that file is 83 git repos. These were the source of the "top-100 crates" we have been analyzing. Assess finds 219 packages in them. The old script finds fewer (it would take some work to find out how many exactly...)

For instance, the old script just runs cargo kani --only-codegen at the root. tokio for instance has no Cargo.toml at the root of its repo. So the old script just fails immediately there but assess scan finds 10 packages in that repo (4 success, 6 failed to build).

Why is it taking 5 hours to collect these metrics? It's using the MIR linker which in principle is more efficient, right?

It is not necessarily more efficient. The MIR linker worked wonders for most of our customers because the reachability graphs are typically so small. But something in some of this code that we're analyzing seems like it's triggering huge reachability graphs for some reason. Combine that with having several integration tests under tests which are built separately, and that means those huge graphs are regenerated over and over (the downside of not codegen'ing things once and being able to re-use it). I've been compiling notes on this performance problem...

Do we account for the same repository for different crates?

I don't understand if this question is asking something else, but I think it was probably answered above?

Would it make sense to pull the code from https://crates.io/ instead of pulling the repository?

Not directly, no. The next step we want here is to do things like run the tests, so I don't think we want that.

But I do think it'd be a good idea to maybe add a --filter-packages option or something, so we could conclusively report on just the crates we're interested in.

Are we compiling integration tests to this mix?

Yep. Not being counted as separate packages, just increasing "instances of use" (and build times...)

I was also wondering what is the use case for this feature. Should it be part of kani-driver or a utility tool built on the top of cargo kani?

Part of driver. There are several customers already where they have multiple rust projects that are not in a shared workspace. Anyone who wants to run assess on all of them at once will need this tool. Breaking it out to somewhere else is just extra complexity.

@tedinski
Copy link
Contributor Author

FYI, #2032 may have an impact on the number of caller_location unsupported feature warnings.

What Celina said, though I suspect you're maybe on to something here for at least one reason why the reachability graphs get so huge for some of these crates. I've been wondering if all the stdlib backtrace machinery might be one cause...

@rahulku
Copy link
Contributor

rahulku commented Dec 22, 2022

is this an apples to apples comparison between the script? Seems like this is a superset? Is it possible to do a side by side comparison to ensure we are not looking at amplified or spurious numbers?

@celinval
Copy link
Contributor

I see @tedinski point that the old script had some flaws in them that might not be worth trying to match just for comparison sake.

Looking forward, I think we should have a clear definition on what is it that we want to measure. What kind of information are we trying to collect?

For the top 100 crates, I was expecting up to 100 crates being reported. The unsupported features count should basically read as a percentage of these 100 crates affected by each construct.

@tedinski
Copy link
Contributor Author

I'm trying to figure out what the simplest approach to doing something more side-by-side would be.

My current plan is to add the filter mechanism to scan, then it only on the exact set of packages from the old script.

I've been digging into the old script's output more and:

  1. 48 packages "appear successful" and don't contribute anything to the unsupported features list.
  2. 12 fail to be handled at all
  3. 3 appears to crash in symtab or goto-cc, not kani-compiler, AND give warnings. This is complicate things slightly, since without "success" assess-scan won't have metadata to examine. Fortunately, this is only 3, so it's easy to see what the impact of missing these should be: (2 crates simd_bitmask, 1 crate PointerCast::ClosureFnPointer, crateTerminatorKind::InlineAsm, 1 crate Projection mismatch`)
  4. 20 are built successfully with warnings

So I think if I filter down to those last 20, I'll have more comparable numbers, with the main difference being that assess-scan still starts from tests, while the old script with the legacy linker was starting from public functions. (And, of course, that with the MIR linker, we have the stdlib around, so numbers might change slightly there too.)

@celinval
Copy link
Contributor

Are you planning to change it to start from public functions instead?

@tedinski
Copy link
Contributor Author

Are you planning to change it to start from public functions instead?

Not planning that presently. It's a feature that would only serve to get us closer to an identical comparison between the old and new scripts, but wouldn't be useful again as soon as we move on to the next step of trying to run all the tests.


Here's the results on the 20 crates I identified from the old script that succeed and contribute unsupported features:

Assessed 20 successfully, with 0 failures.
======================================================
 Unsupported feature           |   Crates | Instances 
                               | impacted |    of use 
-------------------------------+----------+-----------
 caller_location               |       20 |        50 
 simd_bitmask                  |       15 |        42 
 Projection mismatch           |       10 |        18 
 try                           |        7 |        30 
 TerminatorKind::InlineAsm     |        4 |         6 
 PointerCast::ClosureFnPointer |        2 |        20 
 float_to_int_unchecked        |        1 |        24 
 'expf64' intrinsic            |        1 |         1 
 'fmaf32' intrinsic            |        1 |         1 
 'fmaf64' intrinsic            |        1 |         1 
 'log10f32' intrinsic          |        1 |         1 
 'log10f64' intrinsic          |        1 |         1 
 'log2f32' intrinsic           |        1 |         1 
 'log2f64' intrinsic           |        1 |         1 
 'logf64' intrinsic            |        1 |         1 
 'powif32' intrinsic           |        1 |         1 
 'sqrtf64' intrinsic           |        1 |         1 
======================================================
Old script results, for comparison
=========================================================
Unsupported feature | Crates impacted | Instances of use
---------------------------------------------------------
                                simd_bitmask  |  20 |  26
                         Projection mismatch  |  10 |  10
                                         try  |   4 |  13
               PointerCast::ClosureFnPointer  |   3 |   6
                   TerminatorKind::InlineAsm  |   3 |   3
                         'sqrtf64' intrinsic  |   1 |   1
                         'powif64' intrinsic  |   1 |   1
                         'powif32' intrinsic  |   1 |   1
                          'logf64' intrinsic  |   1 |   1
                          'expf64' intrinsic  |   1 |   1
=========================================================

The differences (recall that there are 3 more crates being analyzed by the old script because their build fails, but it still counts their feature warnings):

  • caller_location is certainly due to assess having the stdlib around.
  • Assess sees 5 fewer simd_bitmask. 2 of these are from the 3 crates that fail. 3 are probably due to "test vs pub-fn" reachability.
  • Projection is identical, but because of the 3 crates that fail, this means assess sees 1 fewer. I suspect the same cause: reachability.
  • Assess sees 3 more uses of try. This is probably test code.
  • Assess sees 1 more uses of inline asm. A bit curious. Maybe stdlib or test code.
  • ClosureFnPointer is identical, once we add the missing 1 from the 3 crates that fail.
  • After that, we see more single-use intrinsics. I suspect stdlib is primary cause here.

@tedinski
Copy link
Contributor Author

Also, updated numbers of the overall (220 now) packages in the data set with --all-features removed.

Assessed 176 successfully, with 44 failures.
======================================================
 Unsupported feature           |   Crates | Instances 
                               | impacted |    of use 
-------------------------------+----------+-----------
 caller_location               |      113 |       316 
 simd_bitmask                  |       45 |       183 
 try                           |       41 |       926 
 Projection mismatch           |       30 |        70 
 TerminatorKind::InlineAsm     |       26 |        48 
 PointerCast::ClosureFnPointer |        8 |        37 
 'sqrtf64' intrinsic           |        3 |         6 
 'expf64' intrinsic            |        2 |         4 
 'logf64' intrinsic            |        2 |         4 
 'powif64' intrinsic           |        2 |         4 
 'powf64' intrinsic            |        2 |         3 
 float_to_int_unchecked        |        1 |        24 
 'expf32' intrinsic            |        1 |         2 
 'logf32' intrinsic            |        1 |         2 
 'powf32' intrinsic            |        1 |         2 
 'sqrtf32' intrinsic           |        1 |         2 
 'fmaf32' intrinsic            |        1 |         1 
 'fmaf64' intrinsic            |        1 |         1 
 'log10f32' intrinsic          |        1 |         1 
 'log10f64' intrinsic          |        1 |         1 
 'log2f32' intrinsic           |        1 |         1 
 'log2f64' intrinsic           |        1 |         1 
 'powif32' intrinsic           |        1 |         1 
======================================================

Copy link
Contributor

@celinval celinval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some test coverage? Even if in the form of a bash script.

@adpaco-aws I was wondering if we should create a mode in compiletest to run scripts or something like that. Maybe look for test.yml.

/// The structure of `.kani-assess-metadata.json` files, which are emitted for each crate.
/// This is not a stable interface.
#[derive(Deserialize)]
pub struct AssessMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add the name of the crate and a list of dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assess metadata is across multiple crates, so not really applicable here. Something like that might be nice for kani-metadata, or we could think about what we actually want to aggregate and present here somehow...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we at least add the package name? This could be useful when reporting the successful/failed packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend, in a follow-up PR, to add another table here that might help with that.

But again, metadata is across multiple packages, so there isn't one name to put here. I'll add a better doc comment on this type.

kani-driver/src/assess/metadata.rs Outdated Show resolved Hide resolved
kani-driver/src/assess/scan.rs Outdated Show resolved Hide resolved
let mut cmd = Command::new("cargo");
cmd.arg("kani");
// Use of options before 'assess' subcommand is a hack, these should be factored out.
// TODO: --only-codegen should be outright an option to assess. (perhaps tests too?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about --verbose, --default-unwind, -p, --workspace, --target and so on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, all this prepending stuff needs to go (imo even --enable-unstable).

kani-driver/src/assess/scan.rs Show resolved Hide resolved
kani-driver/src/assess/args.rs Outdated Show resolved Hide resolved
kani-driver/src/assess/args.rs Outdated Show resolved Hide resolved
kani-driver/src/assess/args.rs Show resolved Hide resolved
kani-driver/src/assess/metadata.rs Outdated Show resolved Hide resolved
kani-driver/src/assess/metadata.rs Outdated Show resolved Hide resolved
@tedinski
Copy link
Contributor Author

Data from the actual top-100:

Using tests as reachability roots:

Assessed 74 successfully, with 26 failures.
======================================================
 Unsupported feature           |   Crates | Instances 
                               | impacted |    of use 
-------------------------------+----------+-----------
 caller_location               |       71 |       239 
 simd_bitmask                  |       39 |       160 
 try                           |       29 |      1352 
 Projection mismatch           |       23 |        53 
 TerminatorKind::InlineAsm     |        9 |        21 
 PointerCast::ClosureFnPointer |        7 |        39 
 'sqrtf64' intrinsic           |        2 |         2 
 float_to_int_unchecked        |        1 |        24 
 'expf64' intrinsic            |        1 |         1 
 'fmaf32' intrinsic            |        1 |         1 
 'fmaf64' intrinsic            |        1 |         1 
 'log10f32' intrinsic          |        1 |         1 
 'log10f64' intrinsic          |        1 |         1 
 'log2f32' intrinsic           |        1 |         1 
 'log2f64' intrinsic           |        1 |         1 
 'logf64' intrinsic            |        1 |         1 
 'powif32' intrinsic           |        1 |         1 
======================================================

A hack to use pub-fns as reachability roots:

Assessed 72 successfully, with 28 failures.
======================================================
 Unsupported feature           |   Crates | Instances 
                               | impacted |    of use 
-------------------------------+----------+-----------
 try                           |       77 |      2667 
 simd_bitmask                  |       77 |       580 
 Projection mismatch           |       77 |       279 
 caller_location               |       77 |       279 
 'sqrtf64' intrinsic           |       77 |       275 
 TerminatorKind::InlineAsm     |        8 |        27 
 PointerCast::ClosureFnPointer |        6 |        38 
 float_to_int_unchecked        |        1 |        24 
 'expf64' intrinsic            |        1 |         1 
 'fmaf32' intrinsic            |        1 |         1 
 'fmaf64' intrinsic            |        1 |         1 
 'log10f32' intrinsic          |        1 |         1 
 'log10f64' intrinsic          |        1 |         1 
 'log2f32' intrinsic           |        1 |         1 
 'log2f64' intrinsic           |        1 |         1 
 'logf64' intrinsic            |        1 |         1 
 'powif32' intrinsic           |        1 |         1 
======================================================

Notes:

  • We do see more instances. I think that's probably expected: with test reachability, there is likely code that is not covered by tests in most crates.
  • The sudden jump in 'sqrtf64' intrinsic is initially surprising, but these crates are being built in test mode still, which means it's codegen'ing all the test runner machinery here. I think each of these unsupported features appears there, and that's why they all report 77 identically.
  • The crates impacted being 77, when only 75 succeeded is suspect. I dug into this a little, and it looks like there are a few crates that appear to fail to build on their own for strange reasons (the error message indicates very weirdly configured workspaces: e.g. that the workspace with the futures crate has multiple distinct packages named futures with different versions), but then they do build successfully when building a "sister package" in the workspace. This is causing them to appear in that neighboring package for reasons I'm not clear on yet. I believe this is a problem with the "metadata reconstruction" process in how we deal with cargo_metadata. Certainly it wasn't meant to handle workspaces with multiple packages of the same name. :(

I think the inclusion of the test machinery is a confounder. I'm going to re-do this, but not building in test mode. That means changing two variables in the comparison, however: regular build, not test build, and pub-fn reachability, not test reachability.

@tedinski
Copy link
Contributor Author

Ok, I fixed the crate count bug I noticed above, and re-ran the analysis with the additional change of not building in test mode. Here's the results

With test mode on, and using test reachability (default):

Assessed 74 successfully, with 26 failures.
======================================================
 Unsupported feature           |   Crates | Instances 
                               | impacted |    of use 
-------------------------------+----------+-----------
 caller_location               |       68 |       239 
 simd_bitmask                  |       37 |       160 
 try                           |       26 |      1352 
 Projection mismatch           |       23 |        53 
 TerminatorKind::InlineAsm     |        9 |        21 
 PointerCast::ClosureFnPointer |        7 |        39 
 'sqrtf64' intrinsic           |        2 |         2 
 float_to_int_unchecked        |        1 |        24 
 'expf64' intrinsic            |        1 |         1 
 'fmaf32' intrinsic            |        1 |         1 
 'fmaf64' intrinsic            |        1 |         1 
 'log10f32' intrinsic          |        1 |         1 
 'log10f64' intrinsic          |        1 |         1 
 'log2f32' intrinsic           |        1 |         1 
 'log2f64' intrinsic           |        1 |         1 
 'logf64' intrinsic            |        1 |         1 
 'powif32' intrinsic           |        1 |         1 
======================================================

With test mode off, and using pub-fn reachability (hacked in for comparison):

Assessed 60 successfully, with 40 failures.
==================================================
 Unsupported feature       |   Crates | Instances 
                           | impacted |    of use 
---------------------------+----------+-----------
 caller_location           |       28 |        29 
 simd_bitmask              |        9 |        13 
 TerminatorKind::InlineAsm |        3 |         4 
 Projection mismatch       |        3 |         3 
 'expf64' intrinsic        |        1 |         1 
 'logf64' intrinsic        |        1 |         1 
 'sqrtf64' intrinsic       |        1 |         1 
==================================================

The pub-fns mode shows a substantial increase in the number of crates failing. This is because of an existing bug in the pub-fns reachability which I've filed here:

@tedinski
Copy link
Contributor Author

tedinski commented Dec 28, 2022

I've further tracked down why the old script was detecting more simd_bitmask crates impacted than assess:

The old script incorrectly builds and counts packages. It attempts to build these repos:

repository: https://github.com/RustCrypto/utils
repository: https://github.com/crossbeam-rs/crossbeam
repository: https://github.com/rust-lang/regex
repository: https://github.com/alexcrichton/cc-rs

And it counts them each as "one crate", but a normal build actually builds multiple packages within these workspaces, and so multiple simd_bitmask messages are emitted, and get counted as multiple "crates impacted." In other words, if you ran the script on just cc-rs for instance, it would say it analyzed 1 crate, and found 2 crates impacted by simd_bitmask.

Assess is actually not missing any of these instances, it's just a miscounting bug in the old script.

@tedinski
Copy link
Contributor Author

Ok, current status:

  • Wrote up some dev-documentation for assess, included in this PR.
  • Investigated simd_bitmask, results are above.
  • Tried pub-fns versus test reachability modes, with and without also disabling test mode. Results seem quite promising for tests getting better data here, though possibly clouded by a bug with pub-fns mode that I found and opened.
  • Got a new export of top crates, including both repos and crate-names, now, so we can 100% accurately build the top-100 and get results for exactly those.
  • Added a rudimentary test for assess-scan to the regression. It's just a script that runs and looks for files.

I believe this is ready for review again.

docs/src/dev-assess.md Outdated Show resolved Hide resolved
docs/src/dev-assess.md Outdated Show resolved Hide resolved
docs/src/dev-assess.md Outdated Show resolved Hide resolved
docs/src/dev-assess.md Outdated Show resolved Hide resolved
docs/src/dev-assess.md Outdated Show resolved Hide resolved
kani-driver/src/assess/scan.rs Outdated Show resolved Hide resolved
kani-driver/src/assess/table_builder.rs Outdated Show resolved Hide resolved
scripts/kani-regression.sh Outdated Show resolved Hide resolved
(cd bar && cargo clean))
EXPECTED_FILES=(bar/bar.kani-assess-metadata.json foo/foo.kani-assess-metadata.json bar/bar.kani-assess.log foo/foo.kani-assess.log)
for file in ${EXPECTED_FILES[@]}; do
if [ -f $KANI_DIR/tests/assess-scan-test-scaffold/$file ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the contents of the file to make sure nothing is broken? Perhaps diff them against an expected file? Or grep for certain lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with trying to add more ad-hoc testing steps here. There are at least a couple of checks implicitly happening here:

  1. If scan exits unsuccessfully, it fails.
  2. If something goes wrong with the individual assesses, the metadata files aren't emitted, even if scan somehow misses the failure.
  3. The details of the results shouldn't actually be any different compared to the assess tests that are in the cargo-kani suite. It'd be nice to check that, but it'd kinda be crudely emulating an expect test but with greps?

I think we should do better, but we're stacking up ad-hoc testing scripts that really need a more principled solution...

tests/assess-scan-test-scaffold/README.md Outdated Show resolved Hide resolved
@tedinski
Copy link
Contributor Author

I believe I've addressed all PR comments. I also added a quick script to clone the top-100 and run assess on them, so that the process of running this is clear.

Copy link
Contributor

@zhassan-aws zhassan-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in a follow-up PR, but it would be useful if cargo kani assess scan can take a number of directories as arguments as opposed to needing to have all the packages cloned inside one directory.

@celinval
Copy link
Contributor

celinval commented Jan 3, 2023

I think it would be great if we provide more visibility to why there X failures. I.e.: cargo assess [scan] should still provide data even if we failed to build the package. Ideally, we should be able to categorize whether there was a compilation error or an ICE. For compilation errors, we should collect which errors were found and for ICE we should save at least the location that triggered the panic.

@tedinski
Copy link
Contributor Author

tedinski commented Jan 3, 2023

I think it would be great if we provide more visibility to why there X failures. I.e.: cargo assess [scan] should still provide data even if we failed to build the package. Ideally, we should be able to categorize whether there was a compilation error or an ICE. For compilation errors, we should collect which errors were found and for ICE we should save at least the location that triggered the panic.

Agreed! My plans on next steps here include adding a classifier for failures to build. (e.g. a few categories of other problems plus something that merges on "file:line of the panic in kani-compiler")

Actually, I'll open an issue with that feature request.

Copy link
Contributor

@celinval celinval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at #2047.

Thanks for collecting the data without test mode. If it's not too hard, I would like to be able to collect data without building things on test mode. I'm OK if that for now requires --only-codegen. The test machinery significantly influence the results, but I'm not sure they will impact proof harnesses.

I also would prefer if we don't limit ourselves to things that are covered by tests.

That said, I don't think this is a blocker for this PR and it can be done as a follow up work.

Assess will normally build just like `cargo kani` or `cargo build`, whereas `scan` will find all cargo packages beneath the current directory, even in unrelated workspaces.
Thus, 'scan' may be helpful in the case where the user has a choice of packages and is looking for the easiest to get started with (in addition to the Kani developer use-case, of aggregating statistics across many packages).

(Tip: Assess may need to run for awhile, so try using `screen`, `tmux` or `nohup` to avoid terminating the process if, for example, an ssh connection breaks.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mention running with a memory limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet tried memory limits, but I do (when not using --codegen-only and running tests) actually put an explicitly lower -j value so the machine doesn't run out of memory. I can at least mention both options.


Unimplemented features are not necessarily actually hit by (dynamically) reachable code, so an immediate future improvement on this table would be to count the features *actually hit* by failing test cases, instead of just those features reported as existing in code by the compiler.
In other words, the current unsupported features table is **not** what we'd really want to see, in order to actually prioritize implementing these features, because we may be seeing a lot of features that won't actually "move the needle" in making it more practical to write proofs.
Because of our operating hypothesis that code covered by tests is code that could be covered by proof, measuring unsupported features by those actually hit by a test should provide a better "signal" about priorities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I agree here. Conceptually, finding a failure that no existing unit test finds is exploring uncovered code. Also, wouldn't this already be reported in the test failure table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I agree here. Conceptually, finding a failure that no existing unit test finds is exploring uncovered code.

Right, but we aren't finding failures that no existing unit test covers?

Also, wouldn't this already be reported in the test failure table?

The test failure table is largely expected to see tests fail because of deficiencies in Kani, not something else. (We could conceivably find more failures, like MIRI does, but we don't implement a lot of Rust-specific undefined behavior yet, so this is less likely.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the runtime check. Wouldn't that be the reason why the test failed? Or is this reported as an assertion failure today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an unsupported feature or missing function that we hit, yeah, it comes back as a failed property. (Not necessarily with the 'assertion' property class.)

Unimplemented features are not necessarily actually hit by (dynamically) reachable code, so an immediate future improvement on this table would be to count the features *actually hit* by failing test cases, instead of just those features reported as existing in code by the compiler.
In other words, the current unsupported features table is **not** what we'd really want to see, in order to actually prioritize implementing these features, because we may be seeing a lot of features that won't actually "move the needle" in making it more practical to write proofs.
Because of our operating hypothesis that code covered by tests is code that could be covered by proof, measuring unsupported features by those actually hit by a test should provide a better "signal" about priorities.
Implicitly deprioritizing unsupported features because they aren't covered by tests may not be a bug, but a feature: we may simply not want to prove anything about that code, if it hasn't been tested first, and so adding support for that feature may not be important.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not convinced that's the case. I think test is a great starting point, but I don't think we should deprioritize things that aren't covered by tests.

We've talked about automatic implementation of harnesses before that wouldn't rely on tests. E.g.: Function that all its inputs implement the arbitrary type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not convinced that's the case. I think test is a great starting point, but I don't think we should deprioritize things that aren't covered by tests.

It is how assess works, and I thought I wasn't saying "this is how things ought to be" but "this is the operating hypothesis for assess" (see for instance that I explicitly wrote "may not be a bug, but a feature")

I guess I can clarify that this is the operating hypothesis for assess, not a team consensus. We want to look and see.

@tedinski tedinski merged commit 48569a2 into model-checking:main Jan 4, 2023
@tedinski tedinski deleted the assess-scan branch January 4, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replacing supported features scripts with assess
5 participants