Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[unimpl-ed] Show insufficient purge_zero_lamport_account logic #8176

Closed
wants to merge 1 commit into from

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Feb 8, 2020

Problem

Finally, I found the root cause of #8130 ...:

(Moved from #8148's description)

Snapshot's actual content can become divert from the corresponding bank hash. Currently, purge_zero_lamport_account simply causes too-eager storage removals when zero lamport account updates are chained across 3+ slots (For exact situation, see a test).

--- /good-snapshot-pubkeys  2020-02-06 18:57:12.813230728 +0000
+++ /bad-snapshot-pubkeys  2020-02-06 18:57:12.813230728 +0000
@@ -179,6 +179,7 @@
 key: 4vgoKb76Z2vj9V9z7hoQpZkkwJrkL1z35LWNd9EXSi2o
 key: 4x5cmzjq7MBZG4KZjRQ2hepsWG1X9PvnJs1C2Q68PQmQ
 key: 4xARLQv1SoiVNbmXmVYyQzEjG3YA67X1LfH9KkTLtzKd
+key: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn
 key: 4zAe2W9b3GmQKG2ZbinimDB7VgaNvfNoQDKdNKDiHuBU
 key: 4zoHsMTdhuDAkHAHNwEZqNqEmFyyqRxUxHwGuSdeepyB
 key: 52EW9njXGZzFduTLCZWfQNjPb8QMXgLZRWxEFCN3xaAx

=> This diff shows the bank hash mismatch is caused by the presence of 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn

$ less good3.log | grep 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn 
[2020-02-06T06:14:58.602035261Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 0, pubkey: (StoredMeta { write_version: 176, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 0, offset: 18168, lamports: 500000890880 })
[2020-02-06T06:15:02.010395039Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 112211, pubkey: (StoredMeta { write_version: 8079864, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 432426, offset: 0, lamports: 499973176160 })
[2020-02-06T06:15:03.103655633Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 125761, pubkey: (StoredMeta { write_version: 9075320, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 484959, offset: 408, lamports: 499973176147 })
[2020-02-06T06:15:03.411181688Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 136455, pubkey: (StoredMeta { write_version: 9867711, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 526476, offset: 22152, lamports: 973176143 })
[2020-02-06T06:15:03.439180800Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 136546, pubkey: (StoredMeta { write_version: 9874694, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 526846, offset: 22152, lamports: 0 })
$ less bad3.log | grep 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn 
[2020-02-06T06:10:31.668645696Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 0, pubkey: (StoredMeta { write_version: 176, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 2, offset: 8144, lamports: 500000890880 })
[2020-02-06T06:10:34.653084054Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 112211, pubkey: (StoredMeta { write_version: 8076459, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 427648, offset: 29168, lamports: 499973176160 })
[2020-02-06T06:10:35.794195861Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 125761, pubkey: (StoredMeta { write_version: 9071781, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 479476, offset: 136, lamports: 499973176147 })
[2020-02-06T06:10:39.134235318Z DEBUG solana_runtime::accounts_db] xoring.. account: Account { lamports: 499973176147 data.len: 0 owner: 11111111111111111111111111111111 executable: false rent_epoch: 12 hash: 8Mvx4iidAhxWNAhuUU2vFt5RLGdXdJdaRCFM1By7nPPM } slot: 125761, BankHash 6d5e6ce8b914d18c0e775feac5f20c08b60b0940b12d0867d8c4ebe8cebd4310 key: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn

=> This shows bad snapshot doesn't contain some updates of 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, which should have occurred at slot 136455 and 136546. To be specific, 136546 removed the account itself. However, that update doesn't exist in the bad snapshot. That's why that account appears in the bank hash calculation.

$ grep -E '125761|136455|136546' bad-files good-files 
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479477
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479479
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479475
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479476
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479478
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526849
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526480
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484960
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526848
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526479
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526846
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484959
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526478
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484958
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526851
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526850
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484962
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526476
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526477
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484961
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526847

=> This shows bad snapshot actually doesn't contain any storage entries for such slots first of all.

New conclusive findings

We can confirm this is really what happened on the bad-behaving TdS validator:

Firstly, relevant slots for 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn account updates are 136546, 136455 and 125761. 136455 is innocent slot which contains no zero lamport accounts and can be reclaimed normally. 136546 is the slot which contains the should-be-kept zero lamport account update (See above). 125761 is the slot which contained the reappeared old and bad account state. So we can say 136546 is chaining to 125761 via a zero lamport account update.

And there is another zero lamport account update in 125761:

[2020-02-06T06:15:03.103895475Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 125761, pubkey: (StoredMeta { write_version: 9075321, pubkey: Ag51oCkwGy5rkbGEYrcP9GDjvFGMJrEdLxvedLSTSR15, data_len: 0 }, AccountInfo { store_id: 484959, offset: 544, lamports: 0 })

So, 125761 is definitely chaining to an older storage entry. This satisfies this bug's condition.

Remaining question is why this isn't deterministic and not cluster-wide. However, #8148 can partly explain this because that causes some not-deterministic behavior in validators.

Summary of Changes

Only add a test showing the bad behavior because I need some confirmation for the further direction.

Fixes #8130

@ryoqun ryoqun added the work in progress This isn't quite right yet label Feb 8, 2020

assert_load_account(&accounts, current_slot, pubkey, some_lamport);
assert_load_account(&accounts, current_slot, purged_pubkey1, 0);
assert_load_account(&accounts, current_slot, purged_pubkey2, 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion fails as follows:

---- accounts_db::tests::test_accounts_purge_chained stdout ----
thread 'accounts_db::tests::test_accounts_purge_chained' panicked at 'assertion failed: `(left == right)`
  left: `(100225, 2)`,
 right: `(0, 2)`', runtime/src/accounts_db.rs:1928:9
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
  ..: ...
             ...
  13: std::panicking::begin_panic_fmt
             at src/libstd/panicking.rs:334
  14: solana_runtime::accounts_db::tests::assert_load_account
             at runtime/src/accounts_db.rs:1928
  15: solana_runtime::accounts_db::tests::test_accounts_purge_chained
             at runtime/src/accounts_db.rs:2176

This should pass, while purged_pubkey2's lamport being 0.

accounts.store(current_slot, &[(&dummy_pubkey, &dummy_account)]);
accounts.add_root(current_slot);

purge_zero_lamport_accounts(&accounts, current_slot);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out this purge_zero_lamport_accounts() invocartion makes this test pass.

accounts.add_root(current_slot);

current_slot += 1;
accounts.store(current_slot, &[(&purged_pubkey2, &zero_lamport_account)]);
Copy link
Member Author

@ryoqun ryoqun Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem of current purge_zero_lamport_account is that it cannot recognize these chained/dependent/cascading relationship of zero lamport accounts.

In this database settings, no storage should be removed. But currently that function incorrectly removes a storage entry for this line's store. So, the old version (=not yet purged) of purged_pubkey2 re-appears incorrectly, causing this assertion failure.

Copy link
Member Author

@ryoqun ryoqun Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an algorithm expert at all; however the cost of these DAG dependency aware purging logic should be costly. It think that logic at least require individual checks and/or recursive checks. And it basically will be light-weight stop-the-world GC. Also, this could be DOS-friendly.

Because I couldn't come up with a simple and encouraging logic, this PR doesn't come with actual implementation...

I think we have two paths:

  1. Implement that logic rigorously.
  2. Separate non-zero account stores and zero account stores into different ones (To cut indefinitely-long chains).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sakridge I think the path 2. sounds more reasonable. Could you share your thoughts on this bug? Also, is there any better solution in your mind? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2, it isn't clear to me how that exactly simplifies the problem, because don't you still have to track whether these accounts are shielding others even if they are in different stores or not?

accounts.add_root(current_slot);

current_slot += 1;
accounts.store(current_slot, &[(&dummy_pubkey, &dummy_account)]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because of another bug

@ryoqun
Copy link
Member Author

ryoqun commented Feb 12, 2020

Superseded by #8218

@ryoqun ryoqun closed this Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work in progress This isn't quite right yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'Snapshot bank failed to verify', ledger/src/snapshot_utils.rs:462:9
2 participants