Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doppleganger protection can fail if nimbus is setup in a always-restart-on-failure scenario #3687

Closed
Menduist opened this issue May 31, 2022 · 8 comments
Assignees
Milestone

Comments

@Menduist
Copy link
Contributor

Describe the bug
From https://github.com/status-im/infra-docs/blob/master/docs/postmortems/20220526_prater_validator_slashing.md
We had an issue where two servers had the same set of key. We didn't notice it, so server A spent most of his time restarting because the DG protection was stopping it.
That continued until server B was also restarted. Then, because both servers were in the DG protection phase, they didn't detect each other, and got slashed.

Don't see a straightforward way to fix this in the code, but maybe we should advise to put exponential backoff in place, instead of restarting instantly

@arnetheduck
Copy link
Member

Don't see a straightforward way to fix this in the code, but maybe we should advise to put exponential backoff in place, instead of restarting instantly

one way to make this work would be to rely on documented exit codes from the beacon node, such that it can be picked up as "fatal" by restart scripts

it should also be documented in the manual cc @unixpi

@jakubgs
Copy link
Member

jakubgs commented Jun 1, 2022

@arnetheduck this is a good idea, we can use the SuccessExitStatus setting in the Systemd service definition:

SuccessExitStatus=

Takes a list of exit status definitions that, when returned by the main service process, will be considered successful termination, in addition to the normal successful exit status 0 and, except for Type=oneshot, the signals SIGHUP, SIGINT, SIGTERM, and SIGPIPE. Exit status definitions can be numeric termination statuses, termination status names, or termination signal names, separated by spaces. See the Process Exit Codes section in systemd.exec(5) for a list of termination status names (for this setting only the part without the "EXIT_" or "EX_" prefix should be used). See signal(7) for a list of signal names.

Note that this setting does not change the mapping between numeric exit statuses and their names, i.e. regardless how this setting is used 0 will still be mapped to "SUCCESS" (and thus typically shown as "0/SUCCESS" in tool outputs) and 1 to "FAILURE" (and thus typically shown as "1/FAILURE"), and so on. It only controls what happens as effect of these exit statuses, and how it propagates to the state of the service as a whole.

This option may appear more than once, in which case the list of successful exit statuses is merged. If the empty string is assigned to this option, the list is reset, all prior assignments of this option will have no effect.

Example 1. A service with the SuccessExitStatus= setting

SuccessExitStatus=TEMPFAIL 250 SIGKILL

Exit status 75 (TEMPFAIL), 250, and the termination signal SIGKILL are considered clean service terminations.

https://www.freedesktop.org/software/systemd/man/systemd.service.html#SuccessExitStatus=

We could use whatever the code is used when doppelganger detection triggers a node exit.

@arnetheduck
Copy link
Member

https://www.freedesktop.org/software/systemd/man/systemd.exec.html# table 8/9 lists "standard" error codes for many of these situations

@jakubgs
Copy link
Member

jakubgs commented Jun 1, 2022

To be fair most exit codes listed there are for specific technical issues, like failing to set CPU affinity or adjusting OOM, but we should at least not use those for something they are not.

@zah zah added this to the v22.6.0 milestone Jun 9, 2022
@zah
Copy link
Contributor

zah commented Jun 9, 2022

I think the "standard exit codes" tables are useful here only in the sense that it would be good to avoid them. Our custom error code "Failing due to a detected doppelganger" could probably use a value above 1000 to be on the safe side.

To actionable items here are implementing this custom error code and expanding the Nimbus guide with a table for our exit codes.

@zah
Copy link
Contributor

zah commented Jun 19, 2022

This has been addressed by the PR linked above. If Nimbus detects a doppelganger on the network, it will exit with the error code 1031. Users are advised to check for this error code before restarting Nimbus automatically in a supervisor. For example, in systemd, this can be achieved by the following option in the service unit file:

# Don't restart when Doppelganger detection has been activated
RestartPreventExitStatus=1031 

The new functionality will be shipped in Nimbus 22.6.0

@zah zah closed this as completed Jun 19, 2022
jakubgs added a commit to status-im/infra-role-beacon-node-linux that referenced this issue Jun 20, 2022
Otherwise we'd end up in a restart loop that could cause two or more
nodes to synchronize and start at the same time, rendering doppelganger
detection ineffective. For more details see:
status-im/nimbus-eth2#3687

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@jakubgs
Copy link
Member

jakubgs commented Jun 20, 2022

I was trying to find an equivalent for OSX Launchd plist files, but the closes I found was:

SuccessfulExit <boolean>

If true, the job will be restarted as long as the program exits and with an exit status of zero. If false, the job will be restarted in the inverse condition. This key implies that "RunAtLoad" is set to true, since the job needs to run at least once before we can get an exit status.

I don't see a way to specify the exit code, so that's a shame. But then again you'd have to be crazy to run OSX on a server...

@jakubgs
Copy link
Member

jakubgs commented Jun 20, 2022

No such option exists in winsw config either:
https://github.com/winsw/winsw/blob/v2.11.0/doc/yamlConfigFile.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants