Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented NegativeUNL #3380

Closed
wants to merge 34 commits into from
Closed

Implemented NegativeUNL #3380

wants to merge 34 commits into from

Conversation

pwang200
Copy link
Contributor

The Negative UNL is a protocol change to improve the network’s reliability in the face of failed validators without sacrificing network safety. The Negative UNL protocol uses the existing consensus algorithm for the network to agree on a list of validators that are believed to be unreliable at the moment and lists the validators on the ledgers. If a validator V is on that list, it would still participate in consensus in precisely the same way. But nodes with V in their UNL adjust the quorum and V’s validation message is not counted when verifying if a ledger is fully validated. When V becomes reliable, nodes use the consensus algorithm to vote it off the list.

This feature adds a new pseudo-transaction type and a Negative UNL component to the ledger data structure. Any tools or systems that rely on the format of this data will have to be updated.

This feature will need an amendment to activate.

@MarkusTeufelberger
Copy link
Collaborator

Is there a more technical description available of this, ideally including some mathematics?

@MarkusTeufelberger
Copy link
Collaborator

I'm especially worried about the consensus mechanism starting to bleed into the on-ledger representation with this amendment. If servers are unreliable, a UNL registry should remove them immediately, I don't see why it would be so important to vote (some of) them out collectively after 256 ledgers and create spam objects on-ledger for everyone else. Also I don't see why this should be globally coordinated - if nodes just keep their private nUNL, the effects would be nearly the same with no need to do extra work in consensus on which widely trusted validators are globally seen as unreliable. For this reason I'd really like to see the mathematical models and proofs(!) behind this change.

If the concept itself is seen as necessary for some reason, please at least make it off-ledger. Right now everything on-ledger is agnostic to the consensus mechanism used, this somewhat "violates" the layer separation between consensus mechanism and actual database/ledger/blockchain.

@codecov-io
Copy link

codecov-io commented Apr 28, 2020

Codecov Report

Merging #3380 into develop will increase coverage by 0.19%.
The diff coverage is 93.55%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3380      +/-   ##
===========================================
+ Coverage    70.22%   70.41%   +0.19%     
===========================================
  Files          683      685       +2     
  Lines        54621    54999     +378     
===========================================
+ Hits         38360    38730     +370     
- Misses       16261    16269       +8     
Impacted Files Coverage Δ
src/ripple/app/consensus/RCLConsensus.h 80.00% <ø> (ø)
src/ripple/app/consensus/RCLValidations.h 85.71% <ø> (ø)
src/ripple/app/ledger/Ledger.h 100.00% <ø> (ø)
src/ripple/app/tx/impl/Change.h 100.00% <ø> (ø)
src/ripple/app/tx/impl/InvariantCheck.cpp 99.44% <ø> (ø)
src/ripple/app/tx/impl/applySteps.cpp 87.01% <ø> (+0.32%) ⬆️
src/ripple/protocol/Feature.h 100.00% <ø> (ø)
src/ripple/protocol/Indexes.h 100.00% <ø> (ø)
src/ripple/protocol/SField.h 95.34% <ø> (ø)
src/ripple/protocol/impl/Feature.cpp 91.66% <ø> (ø)
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 023f570...646c722. Read the comment docs.

@pwang200
Copy link
Contributor Author

@MarkusTeufelberger
The engineering spec is added to docs/0001-negative-unl/. Hope that can answer some of your concerns. We can discuss after.

@MarkusTeufelberger
Copy link
Collaborator

I'd still like my question answered why this has to be coordinated through a (voted + recorded) pseudo transaction instead of decided privately by each server instead (if necessary, only every flag ledger instead of every ledger to slow the reaction time down a bit) - or was this not even considered? The doc links to internal data that's not published (yet?), so it is a bit hard to know why David proposed this feature the way it is designed here.

@nbougalis
Copy link
Contributor

Hey @MarkusTeufelberger. First off, apologies that we didn't actually include the spec alongside the commit. It's been added with the latest commit.

The doc links to internal data that's not published (yet?), so it is a bit hard to know why David proposed this feature the way it is designed here.

Ack, that should have been caught; apologies. The document primarily discusses the rationale for "simplifying" the original proposal by limiting the number of changes that can be done at once. Here's the relevant content:

Executive Summary

The initial negative UNL specification had a lot of complexity around handling changes to the negative UNL across flag ledgers. It required some complex special cases and reduced safety significantly around flag ledgers. This updated version does not impact safety though operates a bit more slowly.

Ledger Structures

The negative UNL contains three components. One is the current negative UNL, a list of validator public keys. The second is an entry to be added to the negative UNL on the next flag ledger. The third is an entry to be removed from the negative UNL on the next flag ledger.

Pseudo-Transactions

A new pseudo-transaction, UNLModify, is added. It has a flag indicating whether the modification is an addition or a removal. It also has the public key of the validator to be added or removed. Like other pseudo-transactions, it has the ledger sequence number to ensure its ID is unique.

A UNLModify/add transaction sets the validator to be added at the next flag ledger, overwriting a previous entry (if any). A UNLModify/remove transaction sets the validator to be removed at the next flag ledger, overwriting a previous entry (if any). Note that an add for a validator already in the UNL or a remove for a valdiator not in the UNL is a no-op and does not modify the ledger.

Flag Ledger Logic

When deciding how to vote in a flag ledger, each server will take the negative UNL from preceding ledger, apply any addition or removal specified in the ledger, and then decide how to choose a server to add and a server to remove from the negative UNL.

A validator is chosen to vote to remove from the negative UNL as follows:

  1. The server looks for any validators on the negative UNL that produced validations for the ledger prior to this flag ledger.
  2. If none are found, this server will not vote to remove an entry.
  3. If any are found, the public key is XORed with the hash of the parent ledger and the one with the lowest XOR result is chosen.

A validator is chosen to vote to add to the negative UNL as follows:

  1. The server looks for any validators not on the negative UNL that have not produced any recent valid validations.
  2. If none are found, this server will not vote to add an entry.
  3. If any are found, the public key is XORed with the hash of the parent ledger and the one with the lowest XOR result is chosen.

Before a ledger after a flag ledger’s transactions are executed, the UNL is modified by the add and remove entries in the ledger, if any, and the add/remove entries are removed.

Ledger Validation Logic

There is a three step process to determining what UNL to use when fully-validating a ledger. First, the server must decide what negative UNL to use. Second, the server must combine the negative UNL with its own UNL and determine what the effective UNL and quorum requirement is for that ledger. Lastly, the server must see if it has sufficient validations.

My current recommendation for step one is to simply use the UNL in the ledger. Because at least 60% of the server’s own local UNL is always required to accept a ledger as fully validated, I do not believe it is necessary to confirm that the negative UNL is not too different from that of prior ledgers to protect against malicious validators.

The quorum logic is as follows:

  1. Only validations from validators that are on your local UNL but not on the negative UNL count.

  2. Compute 60% of the number validators on your local UNL (round up).

  3. Compute 80% of the validators on our local UNL that are not on the negative UNL (round up).

  4. The effective quorum requirement is the greater of the numbers you computed in steps 2 and 3.

For example, say there are 32 validators on your UNL, 8 entries on the negative UNL of which 7 are on your UNL. The only validations you count are any from the 32-7 validators on your UNL and not on the negative UNL. So there are 25 candidate validators. The quroum is the greater of 60% of 32 and 80% of 25, both of which are 20. So the required quorum is 20 of 25.

Now, you wrote:

I'm especially worried about the consensus mechanism starting to bleed into the on-ledger representation with this amendment. If servers are unreliable, a UNL registry should remove them immediately

This is certainly reasonable and arguably should be done by all responsible publishers. But no publisher can react in near real-time and the reality is that sometimes servers go down for more than just a couple of minutes and that's not necessarily an indication of unreliability. It's reasonable for a UNL publisher to have a policy that, e.g. an outage exceeding 24 hours triggers removal. But what to do during those 24 hours?

This change aims to target that "window" of time between a validator experiencing an outage and the UNL publisher responding by removing that validtor.

We've previously discussed using the ledger for storing such objects, most recently in a conversation involving storing of manifests. You had similar objections back then and I do agree that they are legitimate and certainly merit consideration. With that said, I'm not sure that I agree that this is an inappropriate usage of the ledger or that the consensus is "bleeding into the on-ledger representation."

I'll follow up on some of your other questions tomorrow. 🖖

Copy link
Contributor

@nbougalis nbougalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick first read. Will follow-up with more thorough comments.

src/ripple/app/consensus/RCLValidations.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/TxFormats.h Outdated Show resolved Hide resolved
src/ripple/protocol/impl/SField.cpp Outdated Show resolved Hide resolved
@MarkusTeufelberger
Copy link
Collaborator

Thanks for the additional background! :-)

@pwang200
Copy link
Contributor Author

@MarkusTeufelberger "why this has to be coordinated through a (voted + recorded) pseudo transaction instead of decided privately by each server instead"
Sorry for the late reply. This is what I think: the voted + recorded approach has two desired properties. First, the negative UNL on the ledger is agreed by the network. As a result, the UNLs of nodes are affected the same way. Hence we don't hurt the UNL overlapping condition in the network. If nodes decide privately, they could disable different validators, that leads to reduced UNL overlapping, which lowers the network safety. Secondly, a ledger can be considered as a synchronization point (or a discrete logical clock time), an on ledger negative UNL will take effect at the same logical clock time. In other word, with an on ledger negative UNL, nodes will use the same negative UNL at the same time.

@MarkusTeufelberger
Copy link
Collaborator

As long as UNLs mostly overlap, removing nodes wouldn't necessarily reduce that overlap. It can also increase it. Also network safety isn't reduced in any way (recorded+ coordinated, private + coordinated or private + uncoordinated implementation), only network liveness.

I already suggested using flag ledgers as synchronization points, there's no need to synchronize with others though. A node could just update its nUNL without consulting others at every 256th ledger (which happens to be coordinated with others anyways, so all nUNLs network wide get updated at the same time - they would just look potentially different, e.g. in case some node has severe routing problems to parts of the network).

I get that with an on-ledger implementation they would all be using the same one - but since 80% of nodes would even need to agree that these nodes are unreliable, this would only capture nodes that have (from a network POV) a catastrophic failure, like being offline or extremely overloaded. If a node on my UNL is "reliably unreliable" for me, it should be put on my nUNL - irrespective of if it is unreliable for someone else. Also I don't see why other nodes should be able to force my node to remove a (from my node's PoV) perfectly fine validating node from its UNL.

Lastly, this leaks out UNL information from validators (as I understand it, a validator would only try to vote out nodes that it has on its private UNL). A possible attack(?) scenario would be to run a validator and switch it off for ~130 ledgers to see which other validators want to vote it out. That way I can know with certainty how widely trusted my validator is across the network, even if these votes never make it into a validated ledger (in case my validator doesn't yet have "UNL dominance", meaning it is on 80+% of UNLs of other "UNL dominant" nodes).

@pwang200
Copy link
Contributor Author

I share your concerns. But without on-ledger negative UNL, if nodes disable validators privately, I cannot see if it's safe. It will be like simply lower the quorum from 80% to a lower percentage (with a delay due to flag ledgers).

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not finished with the review, (I really just looked at the code in NegativeUNLVote). Here's a patch with all the suggested changes for a reference: seelabs@d231943

If you do decide to accept some of the comments I thought it might be easier to grab them from that patch.

I also have some larger issues I need to think about, and I obviously need to look at the other code as well, but I wanted to leave the comments I had now instead of waiting to finish the whole thing.

src/ripple/app/misc/NegativeUNLVote.h Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.cpp Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.cpp Outdated Show resolved Hide resolved
@movitto
Copy link
Contributor

movitto commented Apr 30, 2020

I find this feature fascinating though am no expert on Byzantine Consensus so can't comment too much on the theory behind the implementation. A couple things to note though

  • This introduces a 'quasi' on-ledger UNL management mechanism, where nodes are responsible for constructing the list privately off ledger, but UNL entries may be added/removed via an on-ledger mechanism. Obviously there are tradeoffs here but this does result in an inconsistency in UNL management which raises some flags. Would it be possible to extend the concept presented here further and allow nodes to recommend validators to be included in the UNL for non-performance reasons? (expanded my datacenter, recommend a new node spun up in a geographically disparate location, etc)

  • Looking at the code it seems 50% missed validation is the cutoff point, the "low water mark" to determine when a validator is unreliable. I'm curious as to how this figure was derived? Was it through analysis of the network or did it just seem like a good number? If a node is only keeping up with 55% of validations they would still be considered a reliable node. Perhaps the low / high water marks can be voted on as well depending on the performance of the network.

  • Consider calling this feature 'UNL Blacklist' (or perhaps UNBL - Unique Node BlackList) as this would more accurately describe what is actually occurring when certain validators are added to a list to ignore

@MarkusTeufelberger
Copy link
Collaborator

Consider calling this feature 'UNL Blacklist' (or perhaps UNBL - Unique Node BlackList)

How about the "global Synchronized Healthcheck Interval Transgressor list"? :-P

Copy link
Contributor

@carlhua carlhua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @pwang200 I left some comments and I have one major comment that I would like to ask you to consider. The implementation as it stands performs bulk of the calculation at a single synchronization point - which is at a flag ledger. This increases the node's workload at every 256 ledger. Depending on the amount of work, we can potentially see a jitter in the network due to a spike in work load. Although I doubt its "that much" work, we should generally try to avoid doing lots of work at any single one point. Instead, its good practice to distribute the work over a period of time "naturally". In this case, I would ask to do the actual accumulation of scores and other housekeeping over every ledger and only at flag ledger we would do a determination. I would imagine going forward, as the software evolves only more work would be done at flag ledger and its always good to not overload the amount of work needed at any one concentrated point.

src/ripple/app/misc/NegativeUNLVote.h Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Outdated Show resolved Hide resolved
src/ripple/app/consensus/RCLConsensus.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/impl/LedgerMaster.cpp Outdated Show resolved Hide resolved
@carlhua
Copy link
Contributor

carlhua commented Apr 30, 2020

Consider calling this feature 'UNL Blacklist' (or perhaps UNBL - Unique Node BlackList)

How about the "global Synchronized Healthcheck Interval Transgressor list"? :-P

Ah! I think going forward, we should solicit more community feedback during the design/spec writing phase! Would you be interested to review those design artifacts? I am thinking perhaps we can open issues ahead of time to describe what we are doing - or maybe an open repo for improvement proposals. Any thought on this?

@pwang200
Copy link
Contributor Author

@movitto
-- I realized the inconsistency in UNL management as well. Extending the concept to recommend validators is an interesting idea. That leads me thinking about David's Two-layer Consensus post. I personally think recommending validators of the lower layer could be one of the responsibilities of the top layer (or the governing layer). 

-- Yes, so far "it just seems like a good number". I acknowledge the need for analysis of the network. It has not been a priority yet. A targeted scenario of the negative UNL, and an important one, is like several validators go offline during a long weekend. In these kinds of cases, 50% seems a good number. Of course, 60% may seem good too. I will collect data and do some analysis. Though I am not sure if the extra complexity of voting on the water marks are needed. 

-- I feel the word "blacklist" might be misleading. A modification of the original UNL changes two things: who to trust, and who contribute to quorum number calculation and count towards full validation of ledgers. Negative UNL does not change who to trust. So I feel "Blacklist" is kind of too strong. The "global Synchronized Health check Interval Transgressor list" might be more accurate. :-) As @MarkusTeufelberger said.

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2

src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/tx/impl/Change.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Show resolved Hide resolved
src/ripple/app/misc/ValidatorList.h Outdated Show resolved Hide resolved
src/ripple/app/misc/ValidatorList.h Outdated Show resolved Hide resolved
src/ripple/app/misc/ValidatorList.h Outdated Show resolved Hide resolved
src/ripple/app/tx/impl/Change.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@cjcobb23 cjcobb23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick first pass to leave some comments. Good stuff @pwang200 .

src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/Ledger.h Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Show resolved Hide resolved
src/ripple/app/misc/impl/ValidatorList.cpp Outdated Show resolved Hide resolved
src/ripple/app/tx/impl/Change.cpp Outdated Show resolved Hide resolved
src/ripple/app/tx/impl/Change.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/LedgerFormats.h Show resolved Hide resolved
@@ -282,6 +283,11 @@ Ledger::Ledger(Ledger const& prevLedger, NetClock::time_point closeTime)
info_.closeTime =
prevLedger.info_.closeTime + info_.closeTimeResolution;
}

if (prevLedger.rules().enabled(featureNegativeUNL))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is the best place to call updateNegativeUNL(), with the update happening behind the scenes. Updating the negative unl seems similar to updating the skiplist. updateSkipList() is called here:
https://github.com/ripple/rippled/blob/develop/src/ripple/app/ledger/impl/BuildLedger.cpp#L60
I think updateNegativeUNL() should be called there too. When I use this constructor, I expect the constructed object to have the exact same state map as the previous ledger, with no differences. After that, I modify the state map based on new transations, fee voting, updating the skip list, etc. These modifications happen after the constructor. I think updateNegativeUNL() is a similar modification that should happen after the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When I use this constructor, I expect the constructed object to have the exact same state map as the previous ledger, with no differences." Is this agreed? @nbougalis @seelabs If so, we must document that.

updateNegativeUNL() is private. I'd like to keep it private if possible. It must be called after the state map is copied and before applying the transactions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should call it here: https://github.com/ripple/rippled/blob/97712107b71a8e2089d2e3fcef9ebf5362951110/src/ripple/app/ledger/impl/BuildLedger.cpp#L49
updating the negative unl is part of building the ledger, so I would call it in that function, before applying the transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correctness, the updateNegativeUNL() function must be called after the state map is copied and before applying the transactions. So the place you suggested is correct, and the current code is also correct as well. In the current code, however, the function is private. I think that is a benefit, because it reduces the chance of misuse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwang200 @cjcobb23 I understand both points of view. This constructor is for "creating a ledger that follows this one". So updating the negative unl here makes sense. OTOH, we don't do that for the skip list, and they are both similar concepts, so we should treat them the same and update them in the same place. (Side note, I don't necessarily agree that after calling this constructor the state maps need to be the same; this isn't a copy constructor).

My vote is with C.J. The analogy with the skip list is good; we should treat them both the same. The easiest change is to move the neg unl update outside of this constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to move it into buildLedgerImpl, but I agree that there's a bit of smell here. We should revisit this and see if we can cleanup.

src/ripple/protocol/impl/Indexes.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/Indexes.cpp Outdated Show resolved Hide resolved
src/ripple/app/ledger/Ledger.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NegativeUNLVote.h Outdated Show resolved Hide resolved
seelabs and others added 26 commits June 25, 2020 10:49
* This was always a stub and enabled no functional changes
Corrects the public_key parameter name in the comment.
See XRPLF/xrpl-dev-portal#854 for context.
 *The tecUNFUNDED code is actively used when attempting to create payment
  channels; the messages incorrectly list it as deprecated. Meanwhile, the
  tecUNFUNDED_ADD code actually is an unused legacy code, dating back to
  when there was a WalletAdd transactor.

 *Engine result messages are not part of the binary format and are
  documented as subject to change without notice, so this should not
  require an amendment nor a new API version.

 *Mark terLAST, terFUNDS_SPENT deprecated
* Fixes XRPLF#3269

* Work on a version 2 of the XRP Network API has begun.
  The new API returns the code `notSynced` in place of
  `noClosed`, `noCurrent`, and `noNetwork`.  And
  `invalidParams` is returned in place of `lgrIdxInvalid`.
* The version 2 API can be specified by adding
  "api_version" = 2 to your json request.  The default
  version remains 1 (if unspecified), except for the command
  line interface which tracks version 2.
* It can be re-enabled by setting ApiMaximumSupportedVersion to 2.
This commit, if merged, adds support to allow multiple indepedent nodes to
produce a binary identical shard for a given range of ledgers. The advantage
is that servers can use content-addressable storage, and can more efficiently
retrieve shards by downloading from multiple peers at once and then verifying
the integrity of a shard by cross-checking its checksum with the checksum
other servers report.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Amendment API Change Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.