Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover script for a backed up node #41

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Conversation

vkomenda
Copy link
Contributor

See #39. This PR doesn't contain any tests for the failover script.

@vkomenda vkomenda requested review from phahulin and varasev March 19, 2019 14:03
@phahulin phahulin self-requested a review March 21, 2019 16:54
package.json Outdated Show resolved Hide resolved
scripts/watchBackedUpNode.js Outdated Show resolved Hide resolved
scripts/watchBackedUpNode.js Outdated Show resolved Hide resolved
@phahulin phahulin self-requested a review March 22, 2019 11:27
@vkomenda vkomenda force-pushed the vk-watchguard-script branch from babb659 to aa87f0c Compare March 25, 2019 14:26
@vkomenda vkomenda force-pushed the vk-watchguard-script branch from d3e57ab to 8196113 Compare March 27, 2019 11:32
@phahulin
Copy link
Contributor

Seems to be working correctly 👍
But this is not the final script that we want to have for #39, because it connects to localhosts. Though I think it's ok to merge this PR as a proof-of-concept and then have the final script in a separate pull request.
Or we could continue to update this PR and get a final version here. What do you think @vkomenda ?

@vkomenda
Copy link
Contributor Author

Can it call the secondary node locally and the primary node remotely? In that case we can just change the primary node HTTP URL to stop calling it through localhost. That is assuming it is OK to run the script on the secondary node.

@phahulin
Copy link
Contributor

I don't think it can call primary, it's best to close all ports except SSH on validator node for security reasons.
So probably only scanBlocks can be used to verify if primary is down.

@phahulin
Copy link
Contributor

In such a case it seems like we'll have to have similar scripts on both primary and secondary

@varasev
Copy link
Contributor

varasev commented Mar 28, 2019

Could we add an extra RPC call like parity_clearEngineSigner into Parity code which would remove the engine_signer option instead of switching it to DUMMY_SIGNER_ADDRESS. This way a node wouldn't have to have the dummy address. Would it be difficult?

@vkomenda
Copy link
Contributor Author

@phahulin That's a good plan. Would you like me to continue with that in this PR or merge this and open another for splitting the failover script in two?

@varasev It's not difficult, I think, to add parity_clearEngineSigner and it should work better than a dummy signer address. I'll do that.

@phahulin
Copy link
Contributor

IMO it's best to get a final script in this PR

@vkomenda
Copy link
Contributor Author

vkomenda commented Apr 1, 2019

So probably only scanBlocks can be used to verify if primary is down.

OK. In that case how do you check whether the primary is back up?

@phahulin
Copy link
Contributor

phahulin commented Apr 1, 2019

Maybe we could have on-chain storage for that in a special smart contract.
The contract would contain a mapping address => uint256, i.e. mining key => node unique non-zero id. All nodes with the same mining key should have different ids.
When script on one of the nodes detects that validator is missing blocks, it would send a tx to that smart contract and replace current value of id with it's own id (unless current id matches it's own id), wait for tx to be mined and then switch engine signer.
When scripts on other nodes detect that id has changed, they would clear engine signer on their nodes.
In this case we don't have distinction between primaries/secondaries

What do you think? Also cc @varasev

@vkomenda
Copy link
Contributor Author

vkomenda commented Apr 1, 2019

OK. Let's use smart contract storage. Does this now fit into the old notion of benign misbehaviour? If a miner address failed to produce expected blocks,

  1. should this be reported (rather than simply logged locally) as benign misbehaviour and

  2. should there be an event MisbehaviourHandledBy(replica_id) that leads to the change of the value at the mining address in the above mapping?

@varasev
Copy link
Contributor

varasev commented Apr 1, 2019

Does this now fit into the old notion of benign misbehaviour?

We can't use the reportBenign because this is difficult as in case with the reportMalicious: poanetwork/parity-ethereum#97 - it would require a lot of changes like in poanetwork/parity-ethereum#107 (we didn't test it yet).

Maybe we could have on-chain storage for that in a special smart contract.

It is a possible solution if the validator has some switcher address (or wather address?) bound with the mining address (like staking address). The separate switcher address is needed to have a separate nonce (because of that nonce issue with the mining address).

So, when some candidate create their pool, they will have to have three addresses:

  • mining address
  • staking address
  • switcher address

The switcher address would be allowed to send only the described transaction (to change the node id in the mapping) with zero gas price (this would be achieved by using TxPermission contract).

This approach would require several changes in the contracts. But I think of the next solution:

it's best to close all ports except SSH on validator node for security reasons.

I think, ideally, we could open one TCP port on each node (say, 2019) and set firewall rules so that the inbound traffic would only be allowed between the two nodes since we know their IP addresses and they are constant, I guess.

For example, we have two nodes for the same validator:

Node A (IP = 192.168.10.101, engine_signer = mining address)
Node B (IP = 192.168.10.102, engine_signer is not set)

Each node has the watchguard script listening on port 2019.

The firewall on the node A allows inbound connections on the port 22 and connection from the remote 192.168.10.102:2019 to the local port 2019.

The firewall on the node B allows inbound connections on the port 22 and connection from the remote 192.168.10.101:2019 to the local port 2019.

That way the scripts on the nodes would connect to each other, and other unwanted inbound connections are restricted (except SSH).

This would be simpler than the scheme with node's id and the mapping in a contract.

@vkomenda
Copy link
Contributor Author

vkomenda commented Apr 9, 2019

Ready for review. All tests pass for me.

@phahulin
Copy link
Contributor

For me this test fails
I also double-checked with jsonrpc call and it actually returns true for the secondary (and for the primary)

➜  curl --data '{"method":"eth_mining","params":[],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8544
{"jsonrpc":"2.0","result":true,"id":1}

@vkomenda vkomenda force-pushed the vk-watchguard-script branch from 258e2ac to 7228fef Compare May 3, 2019 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants