Lecture Given by Lindsey Kuper on April 20th, 2020 via YouTube
Previous | Next |
---|---|
Lecture 9 | Lecture 11 |
The trouble with Distributed System design is that in order to be useful, it needs both types of property to be present; however, the open-ended nature of liveness properties makes them hard to reason about.
Let's say we want to implement a protocol that satisfies the safety property of FIFO delivery, but doesn't need to care about any liveness properties. So, we could build a system that either drops or ignores every message...
This is completely useless, but at least it guarantees FIFO delivery! (Doh!)
So, it turns out that on their own, neither safety nor liveness properties are of much practical use. We need a system that implements a combination of both.
Other than the "eventual deliver" property we've spoken of, the only other liveness property we've mentioned is "reliable delivery", and the definition varies depending on who you ask, but here's one definition:
Let
P1
be a process that sends a messagem
to processP2
.
If neitherP1
norP2
crashes, thenP2
eventually delivers messagem
However, you might also see an extended version of the above definition:
Let
P1
be a process that sends a messagem
to processP2
.
If neitherP1
norP2
crashes and not all messages are lost, thenP2
eventually delivers messagem
These definitions vary only in respect of whether or not we should be concerned about message loss. This question then leads us into the topic of fault models that we will come to shortly. However, here we assert that reliable delivery is a liveness property.
But why is this?
Firstly, we can see that the idea of reliable delivery cannot be violated in a finite execution: that is, in a finite execution, we cannot draw a Lamport diagram that demonstrates a violating counter-example.
So, since we know that all the properties we care about are either safety properties, liveness properties or some combination of both, and we know that "reliable delivery" is not a safety property, we know it must therefore be a liveness property.
Any time you design a system, you need to make certain assumptions about the environment in which that system will be operating. So, before you can describe your system as being fault tolerant, you must first clearly define what circumstances could occur that constitute a fault, and secondly, how your system will respond under such circumstances. Only then will you understand what conditions your system needs to tolerate.
In other words, we must start with a clear understanding of what exactly constitutes a fault.
In a simple scenario like the one below, machine M1
asks machine M2
the question "What's the value of x
?" and expects to hear something back such as x=5
.
What sort of faults could occur here?
M2
never receives the message because it was lost due to a communication faultM2
gets the message, but is unable to answer it because it became corruptedM2
gets the message, but only after a significant delayM2
does not respond because it has already crashedM2
crashes as a result of trying to answerM1
's questionM2
is too busy to answer the questionM2
ignores the messageM2
deliberately responds with the wrong answerM2
sends a response back toM1
which gets lost, corrupted or delayedM2
sends a response back toM1
, but thenM1
crashes as it tries to proces the answer- etc...
The problem here is that even in this minimal scenario, we cannot enumerate all the possible errors the might occur; however, we can categorise them.
Let's arrange these types of fault into different informal categories.
Fault Category | Description |
---|---|
Crash | Software execution halts or hardware fails |
Omission | Messages are sent, but not received (I.E. lost) |
Timing | Messages are sent and processed successfully, but very slowly Also known as a performance fault |
Byzantine1 | Intentionally malicious or arbitrary behaviour |
These fault categories have been used in Distributed System's design since about the early 1990's and are presented above in hierarchical order, starting with the lowest level first.
At the bottom of this hierarchy is the crash fault. This simply means that a process has stopped exchanging messages with the other participants in the system.
A crash fault can happen for a variety of reasons: for instance, execution could halt due to a software failure, or the process might continue to handle its own internal messages but cease responding to external messages. Whilst this second case is not due to software execution halting, as far as the other participants in the system are concerned, since this process takes no further part in the overall operation of the system, it may as well have crashed.
When a process continues execution, but for whatever reason, fails to send or receive some, but not all messages.
A process responds either too late or too early. The typical case is where a process responds too slowly; however, at the hardware level, it is possible for a response to be too fast.
In this course, we're going to sidestep the discussion of timing faults because we will be focussing on the asynchronous message delivery model — which never makes any delivery time guarantees in the first place!
A process behaves in an arbitrarily erroneous, or possibly even malicious way. Usually, such behaviour is intentional. Again, in this course we won't be spending much time talking about this type of fault.
Why are the fault categories listed in this particular order?
Let's says we have two different protocols, X
and Y
.
Protocol X |
Protocol Y |
|
---|---|---|
Tolerates | Crash Faults | Omission Faults |
If Protocol Y
tolerates omission faults, does it also tolerate crash faults?
If so, why?
If a process crashes, then every message sent to that process will not be received; therefore, this is a special case of an omission fault in which a process fails to send or receive only a certain number of messages.
Hence if process P1
can tolerate process P2
failing to send or receive some messages (an omission fault), then by definition, P1
must also be able to tolerate the case where P2
fails to send or receive all messages (a crash fault).
The set of crash faults form a proper subset within the set of omission faults.
In the same way that crash faults are a special case of omission faults, omission faults are a special case of timing faults.
A timing fault is where the response to a message is received outside an acceptable timeframe (usually, too slowly), and an omission fault is where the response to a message is never received (infinitely too slow)
A Byzantine fault is where a process behaves in an arbitrary or intentionally malicious way. So, if a protocol is tolerant of Byzantine faults, does this mean it must also tolerate timing faults?
Yes, it does.
The reason is that if a process B
(for Byzantine) starts behaving in an arbitrary or malicious manner, all other processes communicating with B
will be unable to distinguish faults in B
due to timing errors, with faults in B
due to arbitrary or malicious behaviour.
Consequently, if process B
decides that to say:
- "I'm going to pretend to crash", or
- "I'm not going to respond to certain messages from certain other processes", or
- I'm deliberately going to mislead two processes by swapping my responses around
Any process communicating with B
will be unable to determine the true cause of such behaviour.
My Aside
This is because computers are unable of determine the intent of, or motive behind an action, their inferences can only be made on the mechanistic basis of cause and effect
The list of possible fault behaviours here is endless...
For more details, see the paper "Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement" by Cristian et al.
Given the fact that we will confine our discussion to systems that use asynchronous communication, such systems cannot give any guarantees about message delivery times; therefore, for the remainder of this course, discussion around fault categories will look more like this:
No: these fault categories can be subdivided. For instance, some Byzantine faults are easier to deal with than others.
For instance, a Byzantine Fault that is relatively easy to deal with is message alteration or duplication.
If you receive a message that has been altered (maliciously or otherwise), then such faults are generally fairly easy to detect using techniques such as checksums or message hashes. What's much harder to detect is message corruption where that corruption is so subtle that your authentication technique fails to detect it.2
So, we can redraw the above diagram as follows:
If we detect that a message has been corrupted, the simplest way of handling it is to ignore it completely. In this case, we have handled the message in the same way as if we had never received it in the first place. Thus, we have downgraded this particular type of Byzantine fault to an Omission fault.
Based on the preceding discussion we can define a Fault Model as:
The specification of faults that a system may exhibit, which in turn defines the kind of faults which will be tolerated by the system.
Fault models are nested in the same way as fault categories.
In general, the easiest faults to tolerate are the ones in the centre of the diagram. As we move outwards, fault model implementation becomes increasingly complicated simply because a greater number and type of bad things can happen.
In this class, we will be looking mostly at the Omission Model. This will still be hard to implement however, because there are still plenty of things that can go wrong!
This is a classic thought experiment that dates back to 1975 from the appendix of a paper by Akkoyunlu, Ekanadham and Huber called "Some Constraints and Trade-offs in the Design of Network Communications"
In this situation, two armies commanded by Generals Alice and Bob want to attack a common enemy; however, several significant challenges must be overcome before victory can be assured:
- Alice's and Bob's armies are camped in two valleys on opposites sides of their common enemy
- The enemy is camped in the valley in between Alice and Bob
- The enemy has sufficient strength that victory cannot be assured unless Alice and Bob attack simultaneously
- Alice and Bob can only communicate with each other by sending messengers through enemy territory
- Alice and Bob cannot be sure that any messages they send will get through because the messenger risks being captured, and the message fails to get through
What form of communication can these Generals have with each other so that they can both be confident that they are acting on some shared, common knowledge? For instance, if the plan is to attack tomorrow at dawn, then how can both Generals be sure that the other has firstly agreed to this plan, and secondly, is ready to act on it?
Since the armies of Alice and Bob are not strong enough on their own to defeat this enemy, victory is only possible if they arrive at a commonly agreed plan of action
Let's look at a few scenarios:
-
Alice sends Bob the message "Attack at dawn!".
Q: Having sent this message, should Alice simply go ahead and attack at dawn?
A: No, because she has no way of knowing that Bob has even received the message, let alone agreed to it.If Alice attacks on her own, there is a significant chance of defeat, so she should wait for Bob's confirmation that he has agreed to the plan.
-
Let's now say that Bob has received the message, agreed to it, and sent a confirmation back saying "Yes, we attack at dawn!".
Q: Should Bob now go ahead and attack at dawn?
A: No! Because he does not know that Alice has received his confirmation — again, a lone attack would be very risky. -
Let's then say that Alice receives Bob's agreement to attack at dawn - is this good enough for Alice to launch an attack? No, because Bob doesn't know that Alice knows that he's agreed to attack...
And this is all getting very silly and sounds like good material for a Monty Python sketch!
As it turns out, it has been proven impossible to eliminate 100% of the uncertainty inherent in systems where communication reliability cannot be guaranteed.
In the Omission Model, it is impossible for either of the two Generals to know with 100% certainty that the other has agreed to attack. This example is just one of many fundamental impossibility results in distributed systems, and it applies in any setting where the communication between participants is liable to failure.
Some possibilities do exist here:
- Instead of trying to decide on the basis of unreliable communication, make a plan in advance and share that plan between the participants. In other words, act on the basis of shared common knowledge3.
- Since unreliable communication excludes the possibility of achieving total certainty, the best we can hope for is reasonable certainty.
To achieve reasonable certainty, we could adopt the following strategy:
- Alice could send Bob a sequence of messages knowing that on average, at least one will get through.
- For every one of Alice's messages that successfully reaches Bob, Bob sends back an acknowledgement. Again, since not of all these acknowledgments will make it back to Alice, multiple acknowledgements are needed in the hope that at least one will get through.
- Alice receives Bob's first acknowledgement and stops sending her stream of messages.
- After a certain period of time, Bob's confidence will start to grow that at least one of his acknowledgements got through because Alice has stopped sending her original message.
The objective here is not to remove all uncertainty, but to lower it to a level considered acceptable enough to act upon.
Previous | Next |
---|---|
Lecture 9 | Lecture 11 |
Endnotes
1 Why is this type of fault called a "Byzantine" fault?
The name comes from an example in a paper by Leslie Lamport, Robert Shostak and Marshall Pease. Here, they describe a situation in which, in order to avoid being defeated by their enemy, the different Generals in the Byzantine Army must all agree on a coordinated strategy for capturing a city; however, it is suspected that some of these Generals might be traitors and will thus attempt to disrupt the plans of the loyal generals.
To avoid any political issues that might have been created by choosing Generals and Armies from the recent past, they chose to use an example from the Byzantine Empire — the eastern successor empire to the Romans after their empire fell in the 5th Century AD. The original paper can be found here
The point here is that the behaviour of a small number of malicious participants will not jeopardise the overall success of the system.
2 Discussion of such authentication techniques is not covered in this course
3 Knowledge is said to be "common" when all participants in a system know with 100% certainty that all other participants share the same knowledge. But not only is this very difficult to establish externally, it is also very fragile. What would happen for instance, if someone needs to change the plan? How could the changed plan then become accepted, common knowledge?