-
Notifications
You must be signed in to change notification settings - Fork 11.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fastx] How to facilitate Bulk Sync between authorities, authorities and full replicas #194
Comments
@huitseeker has written on some ingredients we can use to solve this, here: https://docs.google.com/document/d/1LXVNqhHSbY499G-seenTYCPRHXormZLK9BXzMSmG7L4/edit?usp=sharing Key readings:
|
Here are my current thoughts on Sync design: |
The core of the argument against the way you're implementing the push-pull gossip (and then using it to argue that the byzantine authorities never flood anybody else) is that:
QED: honest authorities must be open to receive, up to fairness, a stream of transaction hashes claimed to be newly produced by any other authority, including Byzantine ones. Those messages must be received as long as they are fairly received (i.e. a byzantine authority does not produce more data than any other would) and be of a size at least equal to that amount of transaction hashes produced by the median honest authority over the same period of time. If an honest authority produces V TPS, and the corresponding stream of transactions hashes are a V/n data stream, then honest authorities must therefore be open to receiving V/n bytes per second from a Byzantine authority, without constraining those hashes to match valid or new transactions. As a consequence, considering there are f Byzantine authorities, the amount of ingress that the honest node must accept is V(f/n), which should be sufficient to cause a significant amount of slowdown when f is concretely large (average TPS: 512 bytes, average has: 32 bytes, ratio n = 16). |
There's quite a bit more on IBLT-based solutions, notably:
|
Here I discuss one more ingredient we can use for end of epoch or state commitment sync, to get a agreed checkpoint with most of the work done asynchronously and a tiny bit synchronously: https://docs.google.com/presentation/d/1TSs1FPC9INZMpipP1vP2dK7XMmOXtHk5CLWFKO6s7p8/edit?usp=sharing |
Here's a new design that aims at producing a checkpoint as well, but hints at limited amounts of state synchronization along the way: |
In a neighboring PR, @gdanezis was asking:
It's not about the pull-based logicThe notion of "asking once you've seen the commitments", like any pull-based logic, is not a subject of concern. In fact, it's very likely that the commitments themselves, if they point me to finite-size payloads of data that each carry authentication, make me safe in asking for unseen data productively. Concretely, once I ask the server using a hash it signed over, it had better deliver me the correct and valid payload, or I'll have the cryptographic assets to reveal it as Byzantine. The issue is that upstream of that, commitments without throughput limits in themselves are a "damned if you do, damned if you don't" logic: if as a node I don't subscribe to enough commitment streams, I'm likely to be unable to succeed in syncing with the network, and if I do subscribe to Byzantine authorities, I'm likely to be bogged down into improductive resource consumption (through the commitments alone) until I can determine for sure that the commitment data I'm receiving from a subscribee is not valuable. Replay attacksThe problem is that determining the value of commitment data itself (without committing to further resource consumption) is very hard, and quickly becomes a cat-and-mouse game of how much ingress throughput a subscribee can steal from the subscriber without revealing it's Byzantine. That's because there's not much I can do with commitments except read them and see if I already have something for them, by which time the damage is done. The typical attack here is not a DDoS, it's a replay attack, where the replay is information that's 💯 valid but known by the Byzantine node to not make the client node progress. In itself, this commitment replay attack is an amplification of the resource consumption of the sync process, and because it occurs at the expense of non-sync activities of the node (e.g. ingesting new transactions), it becomes a global throughput limit on the whole system. One simple heuristic for clamping down on these shenanigans is to say that for a given unit of time, a client must know if there are interesting new elements to download anywhere in the system after having seen an amount of commitments of size s per server such that (3f+1) s is well below the ingress limits of the node. It's of course a heuristic: I'm not saying it's impossible to have a protocol that works without a hard global cap on the throughput of commitments. For example, the cap could be adaptative and be readjusted periodically depending on load, just as the difficulty is for PoW blockchains. It just gets complicated quickly, and needs to account for f actors serving a commitment replay attack at the highest throughput they can get away with. Absolutely zero structure on batch production (no max batch size, no batch frequency), though, is an additional risk because it lets any Byzantine node do whatever they want with this, and lets the subscribers only have network-based QoS as a defense (evenly divide throughput dedicated to commitment download among subscribees). Network-based QoS possible, but it's not easy to implement, and the implementations are not efficient on the most common network infrastructures today. How to fit things into a constant?I don't have the full story, but one thing that helps is this: If a node has recently seen a batch of transactions, a lot of them are probably referring to transactions in the same batch, depending on each other. How likely is that to occur within the same batch? The Utreexo paper, Figure 2 makes me think this is a pretty frequent case for Bitcoin at least. The question I'm asking myself is: can we structure the presentation of commitments so that we transmit commitments to useful data more sparsely? |
I got a few more resources from Martin kleppmann: Martin Kleppmann martin.kleppmann@cl.cam.ac.uk That sounds good! On the problem of exchanging differences between sets with lots of overlap, there has been some interesting work on BFT set reconciliation algorithms: If you're willing to organise the elements of the set into a hash graph, you can also use a graph reconciliation algorithm, which I've worked on: Best wishes, |
I'm very familiar with Erlay, and have worked on extending + fuzzing rust bindings for minisketch. They're a nice constant-factor improvement on IBLTs that works along the same principles (this is what I was mentioning @asonnino), which we already integrated in our thinking. I'm also very familiar with the hash graph Martin is suggesting (@gdanezis I believe I mentioned it before I joined - you then shared you had attended Martin's presentation in-person in 2019), but it works by re-constituting causal broadcast from scratch, and has a prohibitive communication complexity cost. |
It feels like we now have a critical mass of eyes on this issue, as well as design options / ingredients / proposals, so lets try to make a plan to make incremental progress, while still thinking / designing the best mechanisms today and down the line. PrioritiesIn my mind we have the following priorities:
I think that so far we (well I first of all, mea culpa) have interchangeably referred to all these with variants of sync / synchronization names (the S-word), leading to confusion. These functions are interrelated, and solving one well can support the others. In fact sometimes it feels we are going in circles since (as suggested in the bullet points above) there can be a cyclical dependency between their solutions. But I think it helps to mentally separate these needs if only to plan their development. Summary of proposals and match to priorityI think that this is how the different stands of proposals fit into the above priorities:
Design decisions, and their natureMy aim here is to separate design decisions that impact the protocol -- namely that all authorities / clients need to agree upon for the system to function. Versus design decisions that are more flexible, in that authorities and clients can show a certain amount of discretion, and as a result we can afford to incrementally design the best techniques now, until testnet/mainnet and beyond. This is an important distinction for decentralised platform protocols: we need to impose the smallest amount of obligations upon all, and maximise the flexibility for peer innovation down the line. Core Protocol decisions:
Non-core Protocol decisions:
Call to action: @huitseeker , @velvia , @lxfind , @laura-makdah , @sblackshear -- this is my thinking, lets have a meeting this week of the next to ensure we are all on the same page; make a schedule of when / how to make the core protocol decisions above; and also agree on what decisions should be core, and which we can just be more flexible about (non-core); and then move forward. |
Inverted index:
Then come the hybrid proposals, both based on timestamping, to categorize some state as "behind us", and therefore help both sync (there's some state we no longer have to talk about) and agreement (there's some part of state we're confident we're done with):
There's also IIUC a topology proposal:
Personal noteHere are cross-cutting concerns I have with many of the proposals above, very much including some of mine:
|
At Fastly, the first way we handled syncing of distributed rate limiting data was using bloom filters to detect differences of state. But that ended up being far more expensive than we imagined and a more naive solution that sends the interesting set on a regular basis was much faster in practice. Although we did have the advantage of idempotency, so merging data that was already known about was not an issue. The larger global cache network would communicate using a variation of push-pull gossip, and that allowed a communication to transmit globally at a speed that was indeed very fast. |
@gdanezis thank you for a very comprehensive summary, that really helps. Will try to read through the spanning tree proposal before the meeting tomorrow.
What I'm hoping for from data architecture perspective is that
I know this is a tangential concern but just noting it |
BTW I think @gdanezis the spanning tree checkpointing is brilliant. I like it much more than using modulo math to n-way do set reconciliation. I wonder who keeps track of the graph etc. though. There is one mode of failure I haven't seen discussed much. I suspect that, much more frequently than byzantine attacks, will be the scenario that some authority was offline for like a few minutes or whatever, during which there will be a big "hole" in its data. That would need to be rectified first to make the O(diff) algorithms efficient. |
Edit from @huitseeker : I backed up this long comment copy-pasting the chat bar of one of our discussions from March 1st here: |
State Synchronization discussions summaryThe pre-game summary of George at this comment is a good setup for where we were (esp. proposals) before the talks on March 1st. ProblemsIn a nutshell, we agree we are going to have to solve all of three problems:
We note that while a Follower can afford to trust its server, the reconciliation and agreement problems have to be designed for a Byzantine environment. SolutionsSolving problem A is relatively uncontroversial, in that there's only one proposal to go about it (which will benefit from some of the above insight). For problems B & C, we agree that from the moment that we know the hash of a new transaction from a certain sender, we know how to safely2 synchronize everything that is a causal antecedent of it, including all prior transactions for this sender. So the main obstacle is an exchange of metadata: the hashes of the latest transactions from each account. In solving B & C, we have are several categories of proposals:
The challenge in these protocols is to make sure they are parcimonious in their bandwidth usage (by all rights, we should exchange but a tiny amount of data). In the short term, we can just focus on just the proposals that require modifications to the core protocol (better engineered early). Only the timestamper approaches have that requirement. The core advantage of these approaches is they can define rough segments of state that is produced at a limited rate in the first place:
Where this helps reconciliation is that instead of entering an open-ended reconciliation problem over all data from genesis, we can now have sub-protocols where we address timestamp ranges one after the other, and make synchronization incremental. Where this helps agreement is that we can agree over fragments of the past we commit to have seen: if we assume a reconciliation mechanism that ensures delivery within some time, all honest authorities can make sets of their certificates way older than the time it takes to disseminate them, put these in a state commitment, and try to get agreement on it (2f+1 signatures). We then discussed some details of proposal [3], namely:
At the end of the meeting, we have a positive outlook on integrating some time stamps to the protocol. Footnotes |
This is the background discussion to the #1099 issue. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Authorities need re-assurance at some point they have ingested all updates from other authorities up to a point. Similarly, full replicas (ie. nodes that try to replicate all state, but do not validate yet) also need to track the full state of the system, and to do so need to know they have received all updates from authorities up to a point. This design task is about designing this mechanism.
The text was updated successfully, but these errors were encountered: