Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quick validator key swap #11264

Closed
wacban opened this issue May 8, 2024 · 7 comments
Closed

quick validator key swap #11264

wacban opened this issue May 8, 2024 · 7 comments
Assignees

Comments

@wacban
Copy link
Contributor

wacban commented May 8, 2024

Make it possible to quickly move validator keys from one node to another. The goal is to allow validators to perform common maintenance operations with no downtime.

In a typical scenario a validator needs to restart their node. The reason may be a new neard release, need to update configs or any issue that may be causing the node to misbehave. In such circumstance the node operator runs two nodes - the old one with the validator keys and the new one to get it warm and ready. Once the new node is ready the operator stops the old node, moves the validator keys to the new node and restarts the new node. Unfortunately restarting the node may take some time and this will get worse once the memtrie is release. This issue is to allow to move the validator keys from one node to another quickly.

It's important to make sure that no two nodes have the validator keys at the same time as those would both produce blocks and chunks which can be considered as malicious behaviour.

So far the best ideas to implement this are:

  • Allow node to hot load the validator keys upon receiving a signal. We already have a similar solution for some config fields. The operator can start the new node without validator keys, copy them from old node, stop the old node, signal the new node to load the keys.
  • Allow operator to configure some predefined height at which the validator keys will become valid. The operator can start the new node with the validator keys and start height in the future and stop the old node right before that height.
@staffik staffik self-assigned this May 9, 2024
@staffik
Copy link
Contributor

staffik commented May 13, 2024

The plan is to start with:

  • Integration test (pytest / nayduck) that fails.
  • Draft implementation so that the test passes.
  • First round of review (on the approach).

If looks fine then:

  • Adversenet testing.
  • Forknet testing.
  • Final review.

@staffik
Copy link
Contributor

staffik commented May 15, 2024

2024-05-15 (Wednesday) Update

  • Added integration test for the feature, it fails ofc.
  • Loading validator key on SIGHUP, but it is not picked by client yet.

Changes:
staffik/nearcore@master...staffik:bcd14a28bb8cca2a7e005ad8347c458f6bb6cca1

Working on extending existing UpdateableConfigs so that we can update the client with the new validator key.

@wacban
Copy link
Contributor Author

wacban commented May 20, 2024

@saketh-are Is there anything that we need to do on the networking side? For example announce that that a different node has the validator keys?

@saketh-are
Copy link
Collaborator

I don't believe any changes will be needed on the networking side.

Every node in the network has either 1 or 2 public keys associated with it:

  • Every node has a peer id, a public key identifying the node itself
  • Some nodes also have an account key, a public key identifying the validator account being handled by the node

Every node maintains in memory a mapping of the AccountKey -> PeerId relationships it is aware of. There is some logic already implemented which allows the mapping to be updated if a validator key starts to be hosted by a different peer id.

The things to make sure of are:

  • Running two nodes at the same time with the same peer id is not supported behavior at the network layer.
  • As you said, it's also important to make sure that no two nodes operate the same validator account at the same time.
  • If you plan to hotswap onto a node it should be given sufficient time to form network connections; starting up the backup node on-demand will be worse than simply restarting the original node. Ideally the backup node would be running for 10+ minutes already and the operator may wish to confirm that it has a reasonable number of peers before hotswapping onto it.

@staffik
Copy link
Contributor

staffik commented May 21, 2024

2024-05-21 (Tuesday) Update

  • Extended UpdateableConfigs with validator key, so that it can be updated dynamically.
  • Refactored Signer and ValidatorSigner in order to be able to wrap validator key with MutableConfigValue.

Changes:
staffik/nearcore@staffik:bcd14a28bb8cca2a7e005ad8347c458f6bb6cca1...staffik:9ea088938a6d6d4761100bd606de18288a23d107

However, the chain is stuck after the hotswap. Investigating why.

github-merge-queue bot pushed a commit that referenced this issue Jun 10, 2024
Part of: #11264
This PR should be no-op:
- convert `Signer` and `ValidatorSigner` traits into an enum
- wrap `ValidatorSigner` with `MutableConfigValue`

`MutableConfigValue` requires to implement `PartialEq` and `Clone`
traits. These traits are not object safe, thus cannot be derived for
`ValidatorSigner` trait. We need to convert `ValidatorSigner` trait into
an enum, similarly `Signer` trait.
That's also the solution recommended by Rust:
rust-lang/rust#80194

Unfortunately, this change requires a change in enormous number of
places, because the existing code mostly used
`InMemory(Validator)Signer` directly instead of `dyn (Validator)Signer`.
To minimize changes, I added traits like
`From<InMemoryValidatorSigner> for ValidatorSigner` so that it suffices
to add `.into()` in most cases.
github-merge-queue bot pushed a commit that referenced this issue Jun 21, 2024
Issue: #11264

This is follow-up to #11372.
The actual changes (+test) for
#11264 will be done in a third,
final PR: #11536.

### Summary
This PR should mostly be no-op. It focuses on propagating
`MutableConfigValue` for `validator_signer` everywhere.
All instances of mutable `validator_signer` are synchronized.
In case validator_id only is needed, we propagate `validator_signer`
anyway as it contains the current validator info.

### Extra changes
- Remove signer as a field and pass to methods instead: `Doomslug`,
`InfoHelper`, `ChunkValidator`.
- Make some public methods internal where they do not need to be public.
- Split `process_ready_orphan_witnesses_and_clean_old` into two
functions.
- Removed `block_production_started` from `ClientActorInner`.
- Add `FrozenValidatorConfig` to make it possible to return a snapshot
of `ValidatorConfig`.

---------

Co-authored-by: Your Name <you@example.com>
github-merge-queue bot pushed a commit that referenced this issue Jun 24, 2024
Issue: #11264

This PR build on:
* #11372
* #11400

and contains actual changes and test for validator key hot swap.

### Summary
- Extend `UpdateableConfig` with validator key.
- Update client's mutable validator key when we detect it changed in the
updateable config.
- Advertise our new validator key through `advertise_tier1_proxies()`.
- Add integration test for the new behaviour:
- We start with 2 validating nodes (`node0`, `node1`) and 1
non-validating node (`node2`). It is important that the non-validating
node tracks all shards, because we do not know which shard it will track
when we switch validator keys.
  - We copy validator key from `node0` to `node2`.
  - We stop `node0`, then we trigger validator key reload for `node2`.
- Now `node2` is a validator, but it figures out as `node0` because it
copied validator key from `node0`.
- We wait for a couple of epochs and we require that both remaining
nodes progress the chain. Both nodes should be synchronised after a few
epochs.


Test with:
```
cargo build -pneard --features test_features,rosetta_rpc &&
cargo build -pgenesis-populate -prestaked -pnear-test-contracts &&
python3 pytest/tests/sanity/validator_switch_key_quick.py
```

#### Extra changes:
- Use `MutableValidatorSigner` alias instead of
`MutableConfigValue<Option<Arc<ValidatorSigner>>>`
- Return `ConfigUpdaterResult` from config updater.
- Remove (de)serialization derives for `UpdateableConfigs`.
-

---------

Co-authored-by: Your Name <you@example.com>
@staffik
Copy link
Contributor

staffik commented Jun 25, 2024

2024-06-21 (Tuesday) Update
The code is merged, gonna test it on forknet20.

staffik added a commit that referenced this issue Jun 25, 2024
Part of: #11264
This PR should be no-op:
- convert `Signer` and `ValidatorSigner` traits into an enum
- wrap `ValidatorSigner` with `MutableConfigValue`

`MutableConfigValue` requires to implement `PartialEq` and `Clone`
traits. These traits are not object safe, thus cannot be derived for
`ValidatorSigner` trait. We need to convert `ValidatorSigner` trait into
an enum, similarly `Signer` trait.
That's also the solution recommended by Rust:
rust-lang/rust#80194

Unfortunately, this change requires a change in enormous number of
places, because the existing code mostly used
`InMemory(Validator)Signer` directly instead of `dyn (Validator)Signer`.
To minimize changes, I added traits like
`From<InMemoryValidatorSigner> for ValidatorSigner` so that it suffices
to add `.into()` in most cases.
staffik added a commit that referenced this issue Jun 25, 2024
Issue: #11264

This is follow-up to #11372.
The actual changes (+test) for
#11264 will be done in a third,
final PR: #11536.
This PR should mostly be no-op. It focuses on propagating
`MutableConfigValue` for `validator_signer` everywhere.
All instances of mutable `validator_signer` are synchronized.
In case validator_id only is needed, we propagate `validator_signer`
anyway as it contains the current validator info.
- Remove signer as a field and pass to methods instead: `Doomslug`,
`InfoHelper`, `ChunkValidator`.
- Make some public methods internal where they do not need to be public.
- Split `process_ready_orphan_witnesses_and_clean_old` into two
functions.
- Removed `block_production_started` from `ClientActorInner`.
- Add `FrozenValidatorConfig` to make it possible to return a snapshot
of `ValidatorConfig`.

---------

Co-authored-by: Your Name <you@example.com>
staffik added a commit that referenced this issue Jun 25, 2024
Issue: #11264

This PR build on:
* #11372
* #11400

and contains actual changes and test for validator key hot swap.

### Summary
- Extend `UpdateableConfig` with validator key.
- Update client's mutable validator key when we detect it changed in the
updateable config.
- Advertise our new validator key through `advertise_tier1_proxies()`.
- Add integration test for the new behaviour:
- We start with 2 validating nodes (`node0`, `node1`) and 1
non-validating node (`node2`). It is important that the non-validating
node tracks all shards, because we do not know which shard it will track
when we switch validator keys.
  - We copy validator key from `node0` to `node2`.
  - We stop `node0`, then we trigger validator key reload for `node2`.
- Now `node2` is a validator, but it figures out as `node0` because it
copied validator key from `node0`.
- We wait for a couple of epochs and we require that both remaining
nodes progress the chain. Both nodes should be synchronised after a few
epochs.


Test with:
```
cargo build -pneard --features test_features,rosetta_rpc &&
cargo build -pgenesis-populate -prestaked -pnear-test-contracts &&
python3 pytest/tests/sanity/validator_switch_key_quick.py
```

#### Extra changes:
- Use `MutableValidatorSigner` alias instead of
`MutableConfigValue<Option<Arc<ValidatorSigner>>>`
- Return `ConfigUpdaterResult` from config updater.
- Remove (de)serialization derives for `UpdateableConfigs`.
-

---------

Co-authored-by: Your Name <you@example.com>
@staffik
Copy link
Contributor

staffik commented Jul 1, 2024

Confirmed it works fine in forknet20, works with #11689 too.

@staffik staffik closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants