Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory with Kani 0.46.0 on an s2n-quic harness #3030

Closed
zhassan-aws opened this issue Feb 15, 2024 · 14 comments
Closed

Out of memory with Kani 0.46.0 on an s2n-quic harness #3030

zhassan-aws opened this issue Feb 15, 2024 · 14 comments
Assignees
Labels
[C] Feature / Enhancement A new feature request or enhancement to an existing feature. [E] Performance Track performance improvement (Time / Memory / CPU) T-High Priority Tag issues that have high priority T-User Tag user issues / requests

Comments

@zhassan-aws
Copy link
Contributor

Kani v0.46.0 runs out of memory on the inet::checksum::tests::differential harness in s2n-quic.

This is an example run that runs out of memory:

https://github.com/aws/s2n-quic/actions/runs/7892365794/job/21538690964

This is an example passing run with Kani 0.45.0:

https://github.com/aws/s2n-quic/actions/runs/7837153634/job/21386277828

@zhassan-aws zhassan-aws added T-High Priority Tag issues that have high priority [E] Performance Track performance improvement (Time / Memory / CPU) [C] Feature / Enhancement A new feature request or enhancement to an existing feature. T-User Tag user issues / requests labels Feb 15, 2024
@zhassan-aws
Copy link
Contributor Author

I traced this back to this upstream change:

rust-lang/rust#118578

Replacing all the split_at calls in https://github.com/aws/s2n-quic/blob/cf0b40ec213ba9437c8c4dfac43cca593b15e62f/quic/s2n-quic-core/src/inet/checksum.rs#L408 with split_at_unchecked (which requires using unsafe and adding #![feature(slice_split_at_unchecked)]), verification succeeds.

The upstream change effectively added an extra layer of Option<(&[T], &[T])> that split_at needs to unwrap (which it does through a match). This extra layer seems to result in a significant increase in the symex run time/unwinding. The CBMC stats with Kani 0.46.0 are:

Runtime Symex: 231.01s
size of program expression: 4084816 steps
slicing removed 3624753 assignments
Generated 94180 VCC(s), 36701 remaining after simplification
Runtime Postprocess Equation: 9.21824s
Passing problem to propositional reduction
converting SSA
Runtime Convert SSA: 23.0261s

whereas for a passing run with Kani 0.45.0, the stats are:

Runtime Symex: 11.9535s
size of program expression: 348664 steps
slicing removed 274660 assignments
Generated 20558 VCC(s), 6312 remaining after simplification
Runtime Postprocess Equation: 0.413516s
Passing problem to propositional reduction
converting SSA
Runtime Convert SSA: 1.28765s
Running propositional reduction
Post-processing
Runtime Post-process: 0.145403s

@zhassan-aws
Copy link
Contributor Author

@tautschnig can you take a look?

@zhassan-aws
Copy link
Contributor Author

Replacing all the split_at calls in https://github.com/aws/s2n-quic/blob/cf0b40ec213ba9437c8c4dfac43cca593b15e62f/quic/s2n-quic-core/src/inet/checksum.rs#L408 with split_at_unchecked (which requires using unsafe and adding #![feature(slice_split_at_unchecked)]), verification succeeds.

Actually, just replacing the split_at call in write_sized_generic with split_at_unchecked is sufficient to make verification succeed. This is the complete diff:

diff --git a/quic/s2n-quic-core/src/inet/checksum.rs b/quic/s2n-quic-core/src/inet/checksum.rs
index 551c88b9..9b0843ae 100644
--- a/quic/s2n-quic-core/src/inet/checksum.rs
+++ b/quic/s2n-quic-core/src/inet/checksum.rs
@@ -58,7 +58,7 @@ fn write_sized_generic<'a, const MAX_LEN: usize, const CHUNK_LEN: usize>(
     //# }
 
     while bytes.len() >= MAX_LEN {
-        let (chunks, remaining) = bytes.split_at(MAX_LEN);
+        let (chunks, remaining) = unsafe { bytes.split_at_unchecked(MAX_LEN) };
 
         bytes = remaining;
 
@@ -404,12 +404,12 @@ mod tests {
     // Reduce the length to 4 for Kani until
     // https://github.com/model-checking/kani/issues/3030 is fixed
     #[cfg(any(kani, miri))]
-    const LEN: usize = if cfg!(kani) { 4 } else { 32 };
+    const LEN: usize = if cfg!(kani) { 16 } else { 32 };
 
     /// * Compares the implementation to a port of the C code defined in the RFC
     /// * Ensures partial writes are correctly handled, even if they're not at a 16 bit boundary
     #[test]
-    #[cfg_attr(kani, kani::proof, kani::unwind(9), kani::solver(cadical))]
+    #[cfg_attr(kani, kani::proof, kani::unwind(17), kani::solver(cadical))]
     fn differential() {
         #[cfg(any(kani, miri))]
         type Bytes = crate::testing::InlineVec<u8, LEN>;
diff --git a/quic/s2n-quic-core/src/lib.rs b/quic/s2n-quic-core/src/lib.rs
index 4ec0e3cf..65971201 100644
--- a/quic/s2n-quic-core/src/lib.rs
+++ b/quic/s2n-quic-core/src/lib.rs
@@ -3,6 +3,7 @@
 
 #![cfg_attr(not(any(test, feature = "std")), no_std)]
 
+#![feature(slice_split_at_unchecked)]
 #[cfg(feature = "alloc")]
 extern crate alloc;

@camshaft
Copy link
Contributor

camshaft commented Feb 16, 2024

What's weird is the line above the split_at checks to make sure that it's in bounds.

while bytes.len() >= MAX_LEN {

I guess I don't know how much optimizations we do in MIR or GOTO but I would expect the split_at to be equivalent to split_at_unchecked with that check. Even with --opt-level=1, all of the panics are removed with the LLVM backend: https://godbolt.org/z/rjcqn67Kc.

@tautschnig
Copy link
Member

It seems the main problem is indeed the bounds check performed by split_at, which in turn results in a number of further operations that are conditional on that. What we seemingly would need, and this is in line with what @camshaft said, is an interval abstract domain running alongside the analysis. We want to do this for loops anyway, so we might as well extend the use to also cover cases like this one.

@mina86
Copy link

mina86 commented Feb 20, 2024

FYI, I’ve no idea what Kani is or what it does, but if you’re happy with nightly features, the proper way to do loop seen in the diff is:

    while let Some((chunks, remaining)) = bytes.split_at(MAX_LEN) {

Or if MAX_LEN is a constant you can use split_first_chunk:

    while let Some((chunks, remaining)) = bytes.split_first_chunk::<{ MAX_LEN }>() {

@celinval
Copy link
Contributor

Replacing all the split_at calls in https://github.com/aws/s2n-quic/blob/cf0b40ec213ba9437c8c4dfac43cca593b15e62f/quic/s2n-quic-core/src/inet/checksum.rs#L408 with split_at_unchecked (which requires using unsafe and adding #![feature(slice_split_at_unchecked)]), verification succeeds.

Actually, just replacing the split_at call in write_sized_generic with split_at_unchecked is sufficient to make verification succeed. This is the complete diff:

diff --git a/quic/s2n-quic-core/src/inet/checksum.rs b/quic/s2n-quic-core/src/inet/checksum.rs
index 551c88b9..9b0843ae 100644
--- a/quic/s2n-quic-core/src/inet/checksum.rs
+++ b/quic/s2n-quic-core/src/inet/checksum.rs
@@ -58,7 +58,7 @@ fn write_sized_generic<'a, const MAX_LEN: usize, const CHUNK_LEN: usize>(
     //# }
 
     while bytes.len() >= MAX_LEN {
-        let (chunks, remaining) = bytes.split_at(MAX_LEN);
+        let (chunks, remaining) = unsafe { bytes.split_at_unchecked(MAX_LEN) };
 
         bytes = remaining;
 
@@ -404,12 +404,12 @@ mod tests {
     // Reduce the length to 4 for Kani until
     // https://github.com/model-checking/kani/issues/3030 is fixed
     #[cfg(any(kani, miri))]
-    const LEN: usize = if cfg!(kani) { 4 } else { 32 };
+    const LEN: usize = if cfg!(kani) { 16 } else { 32 };
 
     /// * Compares the implementation to a port of the C code defined in the RFC
     /// * Ensures partial writes are correctly handled, even if they're not at a 16 bit boundary
     #[test]
-    #[cfg_attr(kani, kani::proof, kani::unwind(9), kani::solver(cadical))]
+    #[cfg_attr(kani, kani::proof, kani::unwind(17), kani::solver(cadical))]
     fn differential() {
         #[cfg(any(kani, miri))]
         type Bytes = crate::testing::InlineVec<u8, LEN>;
diff --git a/quic/s2n-quic-core/src/lib.rs b/quic/s2n-quic-core/src/lib.rs
index 4ec0e3cf..65971201 100644
--- a/quic/s2n-quic-core/src/lib.rs
+++ b/quic/s2n-quic-core/src/lib.rs
@@ -3,6 +3,7 @@
 
 #![cfg_attr(not(any(test, feature = "std")), no_std)]
 
+#![feature(slice_split_at_unchecked)]
 #[cfg(feature = "alloc")]
 extern crate alloc;

Can we stub this for now? Also, why wasn't this caught in our regression?

@camshaft
Copy link
Contributor

Can we stub this for now?

We worked around the issue in aws/s2n-quic#2128. However, I wanted to capture this issue and see if we could return to the original length.

why wasn't this caught in our regression?

From my understanding, there are only a handful of s2n-quic harnesses running in the Kani CI. This harness was added ~8 months ago and isn't captured in that list. Even if the Kani CI was running all of the harnesses in s2n-quic, the submodule was quite old and didn't have this harness.

@zhassan-aws
Copy link
Contributor Author

Can we stub this for now?

That should be doable.

Also, why wasn't this caught in our regression?

We hadn't updated the s2n-quic submodule for some time, so this harness wasn't included in our regressions. We've now updated it.

@tautschnig
Copy link
Member

This might be happening again:

* https://github.com/aws/s2n-quic/actions/runs/8742616782/job/23991319945?pr=2184

* https://github.com/aws/s2n-quic/actions/runs/8742616782/job/23993046067?pr=2184#step:3:36574

This wouldn't be too surprising to me: in #3117 (comment) I logged that we appeared to be very narrowly staying within the limits of available memory for this test.

@zhassan-aws
Copy link
Contributor Author

This issue is resolved with Kani 0.54.0: memory usage for the original harness (using a length of 16) does not exceed 1.5 GB:

$ /usr/bin/time --verbose cargo kani --harness inet::checksum::tests::differential
...
SUMMARY:
 ** 0 of 887 failed (6 unreachable)

 ** 1 of 1 cover properties satisfied


VERIFICATION:- SUCCESSFUL
Verification Time: 123.26682s

Complete - 1 successfully verified harnesses, 0 failures, 1 total.
        Command being timed: "cargo kani --harness inet::checksum::tests::differential"
        User time (seconds): 273.60
        System time (seconds): 13.51
        Percent of CPU this job got: 115%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:07.62
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1466228
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 26960
        Minor (reclaiming a frame) page faults: 2999267
        Voluntary context switches: 81426
        Involuntary context switches: 3385
        Swaps: 0
        File system inputs: 3528
        File system outputs: 1548984
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Closing this issue.

@tautschnig
Copy link
Member

Looking at the logs of our past action runs it seems that #3269 made all the difference.

@zhassan-aws
Copy link
Contributor Author

I tracked it down to this change: aws/s2n-quic#2233 which introduced the usage of get_unchecked, and thus removed the extra Option layer that was causing the memory blowup in symex.

Without this change, the memory blowup occurs even with the latest version of Kani (0.55.0). So, unfortunately, it's not a CBMC/Kani improvement that fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[C] Feature / Enhancement A new feature request or enhancement to an existing feature. [E] Performance Track performance improvement (Time / Memory / CPU) T-High Priority Tag issues that have high priority T-User Tag user issues / requests
Projects
None yet
Development

No branches or pull requests

6 participants