-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doppleganger protection can fail if nimbus is setup in a always-restart-on-failure scenario #3687
Comments
one way to make this work would be to rely on documented exit codes from the beacon node, such that it can be picked up as "fatal" by restart scripts it should also be documented in the manual cc @unixpi |
@arnetheduck this is a good idea, we can use the
https://www.freedesktop.org/software/systemd/man/systemd.service.html#SuccessExitStatus= We could use whatever the code is used when doppelganger detection triggers a node exit. |
https://www.freedesktop.org/software/systemd/man/systemd.exec.html# table 8/9 lists "standard" error codes for many of these situations |
To be fair most exit codes listed there are for specific technical issues, like failing to set CPU affinity or adjusting OOM, but we should at least not use those for something they are not. |
I think the "standard exit codes" tables are useful here only in the sense that it would be good to avoid them. Our custom error code "Failing due to a detected doppelganger" could probably use a value above To actionable items here are implementing this custom error code and expanding the Nimbus guide with a table for our exit codes. |
This has been addressed by the PR linked above. If Nimbus detects a doppelganger on the network, it will exit with the error code 1031. Users are advised to check for this error code before restarting Nimbus automatically in a supervisor. For example, in systemd, this can be achieved by the following option in the service unit file:
The new functionality will be shipped in Nimbus 22.6.0 |
Otherwise we'd end up in a restart loop that could cause two or more nodes to synchronize and start at the same time, rendering doppelganger detection ineffective. For more details see: status-im/nimbus-eth2#3687 Signed-off-by: Jakub Sokołowski <jakub@status.im>
I was trying to find an equivalent for OSX Launchd
I don't see a way to specify the exit code, so that's a shame. But then again you'd have to be crazy to run OSX on a server... |
No such option exists in |
Describe the bug
From https://github.com/status-im/infra-docs/blob/master/docs/postmortems/20220526_prater_validator_slashing.md
We had an issue where two servers had the same set of key. We didn't notice it, so server A spent most of his time restarting because the DG protection was stopping it.
That continued until server B was also restarted. Then, because both servers were in the DG protection phase, they didn't detect each other, and got slashed.
Don't see a straightforward way to fix this in the code, but maybe we should advise to put exponential backoff in place, instead of restarting instantly
The text was updated successfully, but these errors were encountered: