Doppleganger protection can fail if nimbus is setup in a always-restart-on-failure scenario #3687

Menduist · 2022-05-31T14:17:02Z

Describe the bug
From https://github.com/status-im/infra-docs/blob/master/docs/postmortems/20220526_prater_validator_slashing.md
We had an issue where two servers had the same set of key. We didn't notice it, so server A spent most of his time restarting because the DG protection was stopping it.
That continued until server B was also restarted. Then, because both servers were in the DG protection phase, they didn't detect each other, and got slashed.

Don't see a straightforward way to fix this in the code, but maybe we should advise to put exponential backoff in place, instead of restarting instantly

arnetheduck · 2022-05-31T14:23:22Z

Don't see a straightforward way to fix this in the code, but maybe we should advise to put exponential backoff in place, instead of restarting instantly

one way to make this work would be to rely on documented exit codes from the beacon node, such that it can be picked up as "fatal" by restart scripts

it should also be documented in the manual cc @unixpi

jakubgs · 2022-06-01T09:30:27Z

@arnetheduck this is a good idea, we can use the SuccessExitStatus setting in the Systemd service definition:

SuccessExitStatus=

Takes a list of exit status definitions that, when returned by the main service process, will be considered successful termination, in addition to the normal successful exit status 0 and, except for Type=oneshot, the signals SIGHUP, SIGINT, SIGTERM, and SIGPIPE. Exit status definitions can be numeric termination statuses, termination status names, or termination signal names, separated by spaces. See the Process Exit Codes section in systemd.exec(5) for a list of termination status names (for this setting only the part without the "EXIT_" or "EX_" prefix should be used). See signal(7) for a list of signal names.

Note that this setting does not change the mapping between numeric exit statuses and their names, i.e. regardless how this setting is used 0 will still be mapped to "SUCCESS" (and thus typically shown as "0/SUCCESS" in tool outputs) and 1 to "FAILURE" (and thus typically shown as "1/FAILURE"), and so on. It only controls what happens as effect of these exit statuses, and how it propagates to the state of the service as a whole.

This option may appear more than once, in which case the list of successful exit statuses is merged. If the empty string is assigned to this option, the list is reset, all prior assignments of this option will have no effect.

Example 1. A service with the SuccessExitStatus= setting
SuccessExitStatus=TEMPFAIL 250 SIGKILL
Exit status 75 (TEMPFAIL), 250, and the termination signal SIGKILL are considered clean service terminations.

https://www.freedesktop.org/software/systemd/man/systemd.service.html#SuccessExitStatus=

We could use whatever the code is used when doppelganger detection triggers a node exit.

arnetheduck · 2022-06-01T11:51:08Z

https://www.freedesktop.org/software/systemd/man/systemd.exec.html# table 8/9 lists "standard" error codes for many of these situations

jakubgs · 2022-06-01T12:30:43Z

To be fair most exit codes listed there are for specific technical issues, like failing to set CPU affinity or adjusting OOM, but we should at least not use those for something they are not.

zah · 2022-06-09T11:29:35Z

I think the "standard exit codes" tables are useful here only in the sense that it would be good to avoid them. Our custom error code "Failing due to a detected doppelganger" could probably use a value above 1000 to be on the safe side.

To actionable items here are implementing this custom error code and expanding the Nimbus guide with a table for our exit codes.

zah · 2022-06-19T10:38:15Z

This has been addressed by the PR linked above. If Nimbus detects a doppelganger on the network, it will exit with the error code 1031. Users are advised to check for this error code before restarting Nimbus automatically in a supervisor. For example, in systemd, this can be achieved by the following option in the service unit file:

# Don't restart when Doppelganger detection has been activated
RestartPreventExitStatus=1031

The new functionality will be shipped in Nimbus 22.6.0

Otherwise we'd end up in a restart loop that could cause two or more nodes to synchronize and start at the same time, rendering doppelganger detection ineffective. For more details see: status-im/nimbus-eth2#3687 Signed-off-by: Jakub Sokołowski <jakub@status.im>

jakubgs · 2022-06-20T07:07:35Z

I was trying to find an equivalent for OSX Launchd plist files, but the closes I found was:

SuccessfulExit <boolean>

If true, the job will be restarted as long as the program exits and with an exit status of zero. If false, the job will be restarted in the inverse condition. This key implies that "RunAtLoad" is set to true, since the job needs to run at least once before we can get an exit status.

I don't see a way to specify the exit code, so that's a shame. But then again you'd have to be crazy to run OSX on a server...

jakubgs · 2022-06-20T07:09:50Z

No such option exists in winsw config either:
https://github.com/winsw/winsw/blob/v2.11.0/doc/yamlConfigFile.md

zah added this to the v22.6.0 milestone Jun 9, 2022

zah assigned tersec Jun 9, 2022

tersec mentioned this issue Jun 10, 2022

separate non-zero exit code for doppelganger detection #3728

Merged

zah mentioned this issue Jun 10, 2022

Don't restart Nimbus instances when they fail with due to detected doppelgangers status-im/infra-nimbus#105

Closed

zah closed this as completed Jun 19, 2022

tersec mentioned this issue Aug 17, 2022

1031 is not a valid program exit code for posix/systemd #3973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doppleganger protection can fail if nimbus is setup in a always-restart-on-failure scenario #3687

Doppleganger protection can fail if nimbus is setup in a always-restart-on-failure scenario #3687

Menduist commented May 31, 2022

arnetheduck commented May 31, 2022

jakubgs commented Jun 1, 2022

`SuccessExitStatus=`

arnetheduck commented Jun 1, 2022

jakubgs commented Jun 1, 2022

zah commented Jun 9, 2022

zah commented Jun 19, 2022 •

edited

Loading

jakubgs commented Jun 20, 2022

`SuccessfulExit <boolean>`

jakubgs commented Jun 20, 2022

Doppleganger protection can fail if nimbus is setup in a always-restart-on-failure scenario #3687

Doppleganger protection can fail if nimbus is setup in a always-restart-on-failure scenario #3687

Comments

Menduist commented May 31, 2022

arnetheduck commented May 31, 2022

jakubgs commented Jun 1, 2022

SuccessExitStatus=

arnetheduck commented Jun 1, 2022

jakubgs commented Jun 1, 2022

zah commented Jun 9, 2022

zah commented Jun 19, 2022 • edited Loading

jakubgs commented Jun 20, 2022

SuccessfulExit <boolean>

jakubgs commented Jun 20, 2022

`SuccessExitStatus=`

zah commented Jun 19, 2022 •

edited

Loading

`SuccessfulExit <boolean>`