Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow arbitrary validator set changes in DHB. #339

Merged
merged 3 commits into from
Nov 18, 2018
Merged

Allow arbitrary validator set changes in DHB. #339

merged 3 commits into from
Nov 18, 2018

Conversation

afck
Copy link
Collaborator

@afck afck commented Nov 14, 2018

This replaces NodeChange with a full list of IDs and public keys, instead of just a single to-be-added or to-be-removed node, to allow completely replacing the set of validators by any arbitrary new set in a single key generation step.

Closes #330.

This replaces `NodeChange` with a full list of IDs and public keys,
instead of just a single to-be-added or to-be-removed node, to allow
completely replacing the set of validators by any arbitrary new set in a
single key generation step.
@afck afck requested review from mbr, c0gent and vkomenda November 14, 2018 11:39
@afck
Copy link
Collaborator Author

afck commented Nov 14, 2018

I prepared the hydrabadger update here: https://github.com/poanetwork/hydrabadger/tree/afck-change

Copy link
Contributor

@vkomenda vkomenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A landmark change! A few remarks.

// Add or Remove a node from the set of validators
NodeChange(NodeChange<N>),
pub enum Change<N: Ord> {
/// Add or remove node from the set of validators. Contains the full new set of validators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not shy away from the scale of the change! I suggest changing the comment as follows:
"Change the set of validators to the one in the provided map. There are no restrictions on the new set of validators. In particular, it can be disjoint with the current set of validators."

/// all nodes.
fn added_node(&self) -> Option<N>;
/// Returns the new set of validator that this batch is starting key generation for, including
/// the existing ones. These should be added to the set of all nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bit misleading to say that the new nodes are added to "all nodes". Do you rather mean that all old peer_epochs should remain monitored and new entries should be created for each new node?

BTW, we don't have an upper limit on the size of peer_epochs regarding (probably very sophisticated) spam control.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point… that's a more general issue that I'm still unclear about:
Do we want to have the sender queue automatically manage the set of observer nodes or should the user also have some control over that (to e.g. add an observer that isn't about to join as a validator)?

  • In the first case, we'd probably also want to remove peers on ChangeState::Complete, in addition to adding them on ChangeState::InProgress, so that the set of peers is always exactly the union of the current and the next set of validators.
  • In the second case, I'm not sure: Either we'd remove this feature completely and require the user to add and remove observers, or we'd maintain an additional list of observers in SenderQueue (those would be the ones that DHB will never know about; they only observe, without ever becoming a validator).

For now, maybe I should actually implement the first point right away? It probably wouldn't be much work and I think it's mainly the intended behavior of the current code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our sender queue is a reference implementation which, I think, should be as simple as possible for that reason. I tend to choose the second option. In principle, the user can fully control the peers and possibly whether or not new peers join as observers only. Removing old peers however should be totally fine in our implementation. The user can change that in their implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that removing peers is actually more complicated than I thought:
If we remove a peer in epoch 42, we'll still need to deliver their messages from epochs up to 41. I think that'll require some changes to update_epoch? I'm not entirely sure. Maybe we should do that as a separate PR after all?

Copy link
Contributor

@vkomenda vkomenda Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, no rush with that. Might be doable with a list of "pending actions", e.g.,

  1. "suspend node id after epoch 41" followed by

  2. "remove the suspended node id as soon as all it's messages are sent".

Maybe this can be done without composing actions but I think there are still temporal dependencies that may require removing in steps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would work. I created #341 for it.

@@ -10,8 +10,8 @@ where
C: Contribution,
N: NodeIdT + Rand,
{
fn added_node(&self) -> Option<N> {
None
fn new_nodes(&self) -> Vec<N> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe added_nodes or - better matching - added_peers?

@afck
Copy link
Collaborator Author

afck commented Nov 14, 2018

Looks like I broke test_dynamic_honey_badger_first_delivery_silent… investigating… 😕

@afck
Copy link
Collaborator Author

afck commented Nov 14, 2018

OK, I found the problem. It's just the "first delivery" thing again, and the fact that I effectively increased the required threshold for key generation (which is the right thing to do, I think): In the test, nodes 0 to 5 just keep producing batches and the others never get to send a message.
I'll try to fix the test tomorrow.

Make sure every node eventually gets to handle its messages.
@afck
Copy link
Collaborator Author

afck commented Nov 15, 2018

Please take another look; the "first" delivery tests now choose a random node in every 10th step.

.filter(|(_, node)| !node.queue.is_empty())
.map(|(id, _)| id.clone());
let rand_node = match *self {
MessageScheduler::First => rand::thread_rng().gen_weighted_bool(10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user implements first non-randomized delivery? Wouldn't DHB fail then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the adversary can do that, Honey Badger fails anyway.
The assumption is that messages from a correct node to a correct node always eventually arrive. The previous "first delivery" implementation didn't satisfy that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that previously First used to work. Is there a fix of the algorithm without changing the test?

Copy link
Collaborator Author

@afck afck Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: What broke it was that in is_ready I'm now requiring N - f signatures complete Parts, which is what we should always expect to be able to reach. I could revert that, but I think it's safer this way. And I think it just exposed what was wrong with the test in the first place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I changed it in this PR is that before, we had the criterion: f + 1 complete Parts and the candidate's Part must be complete.

Now that the candidates could potentially be all of the new nodes, that criterion would actually be too strong. So I changed it to the strongest criterion that we can always expect to be satisfied, under the assumptions the algorithms are making anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the real world the "first delivery" scenario is unrealistic, and corresponds to a network that would completely block the connection between C and D as long as there are messages sent from A to B. The requirement has not changed:

The adversary is allowed to control the network as long as every message from a correct node to a correct node eventually arrives.

This requirement was not satisfied by the "first" schedule. That's why the test was wrong, and we were just "lucky" that it passed so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in general. Now you essentially have two randomized delivery strategies. I think we still need a deterministic one, round-robin for example. I'm not sure what priority that would have given that we move all tests to net.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, in the long run it would be good to have several different "extreme" delivery strategies (while still satisfying the above requirement). We could easily make this one deterministic, too, by making it do round-robin instead of randomizing in every 10th round.
But I didn't want to spend too much time on implementing delivery strategies for the old test network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting more convinced about this change but still think that the new validator set feature is not explored enough to be deemed safe and moreover the old test failed on it. This is too big for a single-person review. @c0gent, @mbr, please check if you are happy with @afck's "real world" argument.

Buyer beware: adversaries would stretch the real world boundaries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, as a mathematician, I don't care for "real world arguments" myself! 😎
However, in this case, it's also technically correct! The "first" delivery strategy simply didn't meet the theoretical requirement.

Copy link
Contributor

@c0gent c0gent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the Hydrabadger changes look good to me. I'll address the whole node removal/reconnection aspect soon as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants