Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential deadlock in state_key::Entry::drop #14670

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

msmouse
Copy link
Contributor

@msmouse msmouse commented Sep 18, 2024

Description

If before Entry::drop() calls Registry::maybe_remove, the weak ref under the same key1 and key2 is replaced, Entry::drop can be called recursively resulting in a deadlock. The crux is we try to see if the entry has been replaced by trying to upgrading it to a strong ref, and it possible that the upgrade is successful and the result becomes a only ref to the new entry.

This tries to fix it by determining the same thing by seeing if Weak::strong_count() is 0 instead of upgrading the weak ref.

Type of Change

  • Bug fix

Which Components or Systems Does This Change Impact?

  • Validator Node

How Has This Been Tested?

eyes of hackers

Copy link

trunk-io bot commented Sep 18, 2024

⏱️ 7h 34m total CI duration on this PR
Job Cumulative Duration Recent Runs
execution-performance / single-node-performance 1h 51m 🟥🟥🟩🟩🟥 (+1 more)
forge-compat-test / forge 1h 5m 🟩🟩🟩
forge-e2e-test / forge 48m 🟩🟩🟩
forge-framework-upgrade-test / forge 47m 🟩🟥🟩
execution-performance / test-target-determinator 22m 🟩🟩🟩🟩🟩
test-target-determinator 18m 🟩🟩🟩🟩
check 15m 🟩🟩🟩🟩
general-lints 12m 🟩🟩🟩🟩🟩 (+2 more)
rust-cargo-deny 12m 🟩🟩🟩🟩🟩 (+2 more)
rust-move-tests 10m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
check-dynamic-deps 7m 🟩🟩🟩🟩🟩 (+3 more)
run-tests-main-branch 7m 🟥
rust-doc-tests 5m 🟩
rust-doc-tests 5m 🟩
rust-doc-tests 5m 🟩
rust-doc-tests 5m 🟩
rust-images / rust-all 4m 🟥
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+3 more)
file_change_determinator 2m 🟩🟩🟩🟩 (+2 more)
rust-move-tests 1m
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+2 more)
file_change_determinator 50s 🟩🟩🟩🟩
Backport PR 44s 🟥🟥🟩
permission-check 31s 🟩🟩🟩🟩🟩 (+2 more)
permission-check 30s 🟩🟩🟩🟩🟩 (+2 more)
permission-check 18s 🟩🟩🟩🟩🟩 (+2 more)
permission-check 15s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 14s 🟩🟩🟩🟩
determine-docker-build-metadata 11s 🟩🟩🟩🟩
permission-check 6s 🟩🟩
rust-doc-tests 1s
rust-move-tests 1s

🚨 4 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / single-node-performance 24m 18m +35%
forge-e2e-test / forge 18m 14m +23%
forge-compat-test / forge 22m 18m +22%
forge-framework-upgrade-test / forge 20m 16m +21%

settingsfeedbackdocs ⋅ learn more about trunk.io

Copy link
Contributor

@ziaptos ziaptos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp

@@ -152,7 +152,7 @@ where
let mut locked = self.inner.write();
if let Some(map2) = locked.get_mut(key1) {
if let Some(entry) = map2.get(key2) {
if entry.upgrade().is_none() {
if entry.strong_count() == 0 {
map2.remove(key2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move the other conditional inside this if statement, while already here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I tried to think of why I wrote it that way other than one less level of nesting but I can't 🧌

@msmouse msmouse enabled auto-merge (squash) September 18, 2024 18:05

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

If before Entry::drop() calls Registry::maybe_remove, the weak ref under
the same key1 and key2 is replaced, Entry::drop can be called
recursively resulting in a deadlock. The crux is we try to see if the
entry has been replaced by trying to upgrading it to a strong ref, and
it possible that the upgrade is successful and the result becomes a only
ref to the new entry.

This tries to fix it by determining the same thing by seeing if
Weak::strong_count() is 0 instead of upgrading the weak ref.
@msmouse msmouse enabled auto-merge (squash) September 20, 2024 18:12

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on a67cf583b76da5eab692f0e7ff8b06fda9333ed1

two traffics test: inner traffic : committed: 14094.85 txn/s, latency: 2819.06 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5359300
two traffics test : committed: 100.09 txn/s, latency: 1661.44 ms, (p50: 1600 ms, p70: 1600, p90: 1700 ms, p99: 7500 ms), latency samples: 1700
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.264, avg: 0.234", "QsPosToProposal: max: 1.095, avg: 1.056", "ConsensusProposalToOrdered: max: 0.321, avg: 0.297", "ConsensusOrderedToCommit: max: 0.451, avg: 0.422", "ConsensusProposalToCommit: max: 0.749, avg: 0.719"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.76s no progress at version 21925 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 9.35s no progress at version 3921364 (avg 4.95s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 25a081116546670e62ca927ba90478de78557056 ==> a67cf583b76da5eab692f0e7ff8b06fda9333ed1

Compatibility test results for 25a081116546670e62ca927ba90478de78557056 ==> a67cf583b76da5eab692f0e7ff8b06fda9333ed1 (PR)
Upgrade the nodes to version: a67cf583b76da5eab692f0e7ff8b06fda9333ed1
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1013.94 txn/s, submitted: 1015.76 txn/s, failed submission: 1.82 txn/s, expired: 1.82 txn/s, latency: 3283.94 ms, (p50: 2100 ms, p70: 3300, p90: 7400 ms, p99: 8500 ms), latency samples: 89300
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 789.18 txn/s, submitted: 792.02 txn/s, failed submission: 2.85 txn/s, expired: 2.85 txn/s, latency: 3972.33 ms, (p50: 2100 ms, p70: 5700, p90: 9000 ms, p99: 10900 ms), latency samples: 72020
5. check swarm health
Compatibility test for 25a081116546670e62ca927ba90478de78557056 ==> a67cf583b76da5eab692f0e7ff8b06fda9333ed1 passed
Upgrade the remaining nodes to version: a67cf583b76da5eab692f0e7ff8b06fda9333ed1
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1162.87 txn/s, submitted: 1163.94 txn/s, failed submission: 1.02 txn/s, expired: 1.07 txn/s, latency: 3225.41 ms, (p50: 2100 ms, p70: 3300, p90: 7200 ms, p99: 8700 ms), latency samples: 91176
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 25a081116546670e62ca927ba90478de78557056 ==> a67cf583b76da5eab692f0e7ff8b06fda9333ed1

Compatibility test results for 25a081116546670e62ca927ba90478de78557056 ==> a67cf583b76da5eab692f0e7ff8b06fda9333ed1 (PR)
1. Check liveness of validators at old version: 25a081116546670e62ca927ba90478de78557056
compatibility::simple-validator-upgrade::liveness-check : committed: 14443.02 txn/s, latency: 2358.48 ms, (p50: 2100 ms, p70: 2200, p90: 2400 ms, p99: 6600 ms), latency samples: 465280
2. Upgrading first Validator to new version: a67cf583b76da5eab692f0e7ff8b06fda9333ed1
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7239.52 txn/s, latency: 3864.30 ms, (p50: 4100 ms, p70: 4200, p90: 5100 ms, p99: 5200 ms), latency samples: 147160
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7743.77 txn/s, latency: 4112.46 ms, (p50: 4100 ms, p70: 4300, p90: 6500 ms, p99: 6700 ms), latency samples: 257780
3. Upgrading rest of first batch to new version: a67cf583b76da5eab692f0e7ff8b06fda9333ed1
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7416.48 txn/s, latency: 3641.47 ms, (p50: 3800 ms, p70: 4600, p90: 5000 ms, p99: 5100 ms), latency samples: 132240
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7567.96 txn/s, latency: 4179.20 ms, (p50: 4200 ms, p70: 4400, p90: 6300 ms, p99: 6700 ms), latency samples: 252320
4. upgrading second batch to new version: a67cf583b76da5eab692f0e7ff8b06fda9333ed1
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 11239.36 txn/s, latency: 2360.84 ms, (p50: 2400 ms, p70: 2800, p90: 3200 ms, p99: 3400 ms), latency samples: 202420
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9959.71 txn/s, latency: 2720.83 ms, (p50: 2500 ms, p70: 2900, p90: 3200 ms, p99: 5700 ms), latency samples: 389100
5. check swarm health
Compatibility test for 25a081116546670e62ca927ba90478de78557056 ==> a67cf583b76da5eab692f0e7ff8b06fda9333ed1 passed
Test Ok

@msmouse msmouse merged commit 8eb8b52 into main Sep 20, 2024
92 checks passed
@msmouse msmouse deleted the 0918-alden-fix-deadlock-2 branch September 20, 2024 18:49
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
aptos-release-v1.20

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants