Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay stage panicked on mainnet-beta API node #11232

Closed
t-nelson opened this issue Jul 28, 2020 · 5 comments · Fixed by #11235
Closed

Replay stage panicked on mainnet-beta API node #11232

t-nelson opened this issue Jul 28, 2020 · 5 comments · Fixed by #11235
Assignees

Comments

@t-nelson
Copy link
Contributor

Problem

thread 'solana-replay-stage' panicked at 'called `Option::unwrap()` on a `None` value', runtime/src/accounts_db.rs:857:29
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at ./cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/libunwind.rs:86

  ...

  15: solana_runtime::accounts_db::AccountsDB::clean_accounts
  16: solana_ledger::snapshot_utils::add_snapshot
  17: solana_ledger::bank_forks::BankForks::generate_snapshot
  18: solana_ledger::bank_forks::BankForks::set_root
  19: solana_core::replay_stage::ReplayStage::new::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

log: https://drive.google.com/file/d/1zkxKc6QDoKm_myODIC03BS-Y6q20_cM6/view?usp=sharing

This node was running the v1.1.22 release

Proposed Solution

Triage and resolve

tag: @ryoqun @sakridge

@ryoqun
Copy link
Member

ryoqun commented Jul 28, 2020

@sakridge Hmm, it looks like the shrink code is very suspicious indeed...

@ryoqun
Copy link
Member

ryoqun commented Jul 28, 2020

Embarrassingly, since the very start of slot shrinking (#9219), it's been always racy with clean_accounts()....

I'm now writing a stress test to expose this.

@ryoqun
Copy link
Member

ryoqun commented Jul 28, 2020

I'm now writing a stress test to expose this.

status update: Wrote one and fixed the code. Pending on reviews. (FYI: @mvines: I'm marking this as a semi v1.2 upgrade blocker because I think we're better of fixing this security bug across testnet/mainnet consistently before jumping branches...)

@ryoqun
Copy link
Member

ryoqun commented Jul 28, 2020

I've confirmed this can occur on mainnet-beta and testnet very infrequently.

Screenshot from 2020-07-29 00-24-27
Screenshot from 2020-07-29 00-24-57

@ryoqun
Copy link
Member

ryoqun commented Sep 9, 2020

Useful queries for chronograf

SELECT time, host_id,program,thread, message FROM "mainnet-beta"."autogen"."panic" ORDER BY time DESC  
SELECT time, host_id,program,thread, message FROM "tds"."autogen"."panic" ORDER BY time DESC  

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants