Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Rust v1.63.0 #25489

Closed
wants to merge 5 commits into from
Closed

Upgrade to Rust v1.63.0 #25489

wants to merge 5 commits into from

Conversation

mvines
Copy link
Member

@mvines mvines commented May 23, 2022

No description provided.

@mvines
Copy link
Member Author

mvines commented May 23, 2022

Two "coverage" job attempts on different machines failed at:

bank::tests::test_remove_unrooted_before_scan
bank::tests::test_remove_unrooted_scan_interleaved_with_remove_unrooted_slots
bank::tests::test_remove_unrooted_scan_then_recreate_same_slot_before_scan

Mashing retry again to see what the 3rd attempt does...

@stale
Copy link

stale bot commented Jun 12, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 12, 2022
@mvines mvines force-pushed the r161 branch 2 times, most recently from dac26ac to 56eacdf Compare June 21, 2022 19:20
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 21, 2022
@mvines
Copy link
Member Author

mvines commented Jun 22, 2022

Same coverage failures as before, but #26115 resolves most of the latest nightly clippy issues to avoid bloating the scope of this issue

@mvines mvines force-pushed the r161 branch 2 times, most recently from f5b75a8 to c4e3de0 Compare June 22, 2022 16:28
@mvines mvines changed the title Upgrade to Rust v1.61.0 Upgrade to Rust v1.62.0 Jun 30, 2022
@mvines mvines mentioned this pull request Jul 1, 2022
@stale
Copy link

stale bot commented Jul 10, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 10, 2022
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 27, 2022
@mvines mvines force-pushed the r161 branch 2 times, most recently from 5440a11 to b1af88d Compare August 4, 2022 15:31
@bw-solana
Copy link
Contributor

bw-solana commented Aug 12, 2022

Looks like we are timing out for all 3 of these test failures in test_store_scan_consistency while scanning banks . We will loop waiting for ~200 seconds for 5 bank scans to complete.

I'm guessing this is a soft timeout as opposed to some sort of deadlock because the tests are passing on my laptop. I'll try increasing the wait time to confirm before diving into what might have slowed down after moving to 1.62

@brooksprumo
Copy link
Contributor

Heh... should we bump all the way to rust v1.63.0 now instead?

https://blog.rust-lang.org/2022/08/11/Rust-1.63.0.html

@bw-solana
Copy link
Contributor

Looks like we are timing out for all 3 of these test failures in test_store_scan_consistency while scanning banks . We will loop waiting for ~200 seconds for 5 bank scans to complete.

I'm guessing this is a soft timeout as opposed to some sort of deadlock because the tests are passing on my laptop. I'll try increasing the wait time to confirm before diving into what might have slowed down after moving to 1.62

Increased timeout 10x and still seeing failures.. need to keep digging

@mvines mvines changed the title Upgrade to Rust v1.62.0 Upgrade to Rust v1.63.0 Aug 12, 2022
@mvines
Copy link
Member Author

mvines commented Aug 12, 2022

I just added Rust 1.63.0 here

@joncinque
Copy link
Contributor

Interestingly, 1.62 was hanging on some tests at recv_mmsg. When I tried to dig in, the Linux-specific implementation using the recv_mmsg syscall didn't make a difference. With the upgrade to 1.63, those issues went away.

@bw-solana
Copy link
Contributor

New set of failures for the following tests:

blockstore_processor::tests::test_process_blockstore_with_dead_slot
blockstore_processor::tests::test_process_blockstore_with_invalid_slot_tick_count
blockstore_processor::tests::test_process_blockstore_with_slot_with_trailing_entry
blockstore_processor::tests::test_process_blockstore_with_supermajority_root_with_blockstore_root
blockstore_processor::tests::test_process_blockstore_with_supermajority_root_without_blockstore_root
blockstore_processor::tests::test_process_ledger_options_full_leader_cache

Looks like they're all failing the assert check at the end of flush due to bucket age advancing unexpectedly.

@brooksprumo
Copy link
Contributor

New set of failures for the following tests:

blockstore_processor::tests::test_process_blockstore_with_dead_slot
blockstore_processor::tests::test_process_blockstore_with_invalid_slot_tick_count
blockstore_processor::tests::test_process_blockstore_with_slot_with_trailing_entry
blockstore_processor::tests::test_process_blockstore_with_supermajority_root_with_blockstore_root
blockstore_processor::tests::test_process_blockstore_with_supermajority_root_without_blockstore_root
blockstore_processor::tests::test_process_ledger_options_full_leader_cache

Looks like they're all failing the assert check at the end of flush due to bucket age advancing unexpectedly.

@jeffwashington Looks like there may be some lingering races still in the bucket aging? Not sure if this is something new with rust v1.63, or just exasperated by.

@bw-solana
Copy link
Contributor

bw-solana commented Aug 12, 2022

New set of failures for the following tests:

blockstore_processor::tests::test_process_blockstore_with_dead_slot
blockstore_processor::tests::test_process_blockstore_with_invalid_slot_tick_count
blockstore_processor::tests::test_process_blockstore_with_slot_with_trailing_entry
blockstore_processor::tests::test_process_blockstore_with_supermajority_root_with_blockstore_root
blockstore_processor::tests::test_process_blockstore_with_supermajority_root_without_blockstore_root
blockstore_processor::tests::test_process_ledger_options_full_leader_cache

Looks like they're all failing the assert check at the end of flush due to bucket age advancing unexpectedly.

@jeffwashington Looks like there may be some lingering races still in the bucket aging? Not sure if this is something new with rust v1.63, or just exasperated by.

Yeah, I can repro this easily on my laptop if I put the non-advancing thread to sleep before this assert check.

It looks like we are somehow releasing the flush lock while we are still inside the flush function. This allows another thread to grab the lock, complete the flush routine, increment buckets flushed, and allow thread 0 to increment age. WIP to understand how/why the flush lock is getting released

@bw-solana
Copy link
Contributor

bw-solana commented Aug 12, 2022

Don't fully understand what's going on yet, but this function is the problem:

impl<'a> FlushGuard<'a> {
    /// Set the `flushing` atomic flag to true.  If the flag was already true, then return `None`
    /// (so as to not clear the flag erroneously).  Otherwise return `Some(FlushGuard)`.
    #[must_use = "if unused, the `flushing` flag will immediately clear"]
    fn lock(flushing: &'a AtomicBool) -> Option<Self> {
        let already_flushing = flushing.swap(true, Ordering::AcqRel);
        (!already_flushing).then_some(Self { flushing })
    }
}

If it is changed to the following, the lock actually seems to work

impl<'a> FlushGuard<'a> {
    /// Set the `flushing` atomic flag to true.  If the flag was already true, then return `None`
    /// (so as to not clear the flag erroneously).  Otherwise return `Some(FlushGuard)`.
    #[must_use = "if unused, the `flushing` flag will immediately clear"]
    fn lock(flushing: &'a AtomicBool) -> Option<Self> {
        let already_flushing = flushing.swap(true, Ordering::AcqRel);
        if !already_flushing {
            Some(Self { flushing })
        } else {
            None
        }
    }
}

@brooksprumo
Copy link
Contributor

Does the version in master work?

(!already_flushing).then(|| Self { flushing })

I'm guessing this was a clippy lint that caused the change in this PR.

@bw-solana
Copy link
Contributor

Does the version in master work?

(!already_flushing).then(|| Self { flushing })

I'm guessing this was a clippy lint that caused the change in this PR.

Yep, master version seems to work.

This problem can be simplified down to this basic unit test:
cargo test --package solana-runtime --lib -- in_mem_accounts_index::tests::test_flush_guard --exact --nocapture

@brooksprumo
Copy link
Contributor

brooksprumo commented Aug 12, 2022

Ah, duh

        (!already_flushing).then_some(Self { flushing })

Will always create the Self, regardless if already_flushing is true or not. So that if .then_some() sees a false, it returns None and causes the just-created Self to drop, which then clears the flushing flag. Heh...

There's a few work-arounds.

  1. Keep then() (probably needs a clippy allow). But gotta be careful when this thing is constructed
  2. Move the Option inside of the FlushGuard, which also requires other work-arounds to make lock() behave as expected
  3. There's probably some way to indicate struct FlushGuard as having a special drop so that the then_some() case is not suggested. Something about drop having side effects maybe?

I'd do option 1, and probably add a comment in lock() saying why not to use then_some() here.

@bw-solana
Copy link
Contributor

@brooksprumo - Do you mean keep the version in master?:
(!already_flushing).then(|| Self { flushing })

@brooksprumo
Copy link
Contributor

@brennanwatt Yep! Sorry for my typo above, fixed it!

@bw-solana
Copy link
Contributor

Able to get a green CI (over at #27095) after the lock changes discussed above

@bw-solana bw-solana mentioned this pull request Aug 15, 2022
@bw-solana
Copy link
Contributor

Created #27148 based on this PR + fixes discussed above

@Lichtso
Copy link
Contributor

Lichtso commented Aug 18, 2022

Closing this one as #27148 has been merged.

@Lichtso Lichtso closed this Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants