15. Failure Models

02/05/23

MoodlePDF

Failure models deal with how (parts of) a distributed system can fail and what can be done to detect and mask such failures
Types of failure include:

Process Omission Failures

When a process fails to perform an action it is supposed to do
Most commonly the process may crash
Other processes may detect that the process has crashed using timeouts
A process crash is 'fail-stop' if other processes can detect with certainty it has failed
- But this is only possible (remotely) in a synchronous system
- In an asynchronous system messages showing it is alive might be lost or still delayed

Communication omission failures

Consider a simple message communication:
- Process $p$ performs a send operation on a message $m$ by inserting it into an outgoing message buffer
- The communication channel transports message $m$ to process $q$ s incoming message buffer
- Process $q$ performs a receive operation to take message $m$ from its incoming message buffer and delivering it
Message can be lost at each state
- e.g. due to congestion in the network, lack of space in a buffer or corruption of the message
- Resulting in send-omission failure, channel omission failure or receive omission failure, respectively

Arbitrary Failures

Arbitrary or Byzantine failures describe the worst possible failure semantics, in which any type of error may occur
A process may omit arbitrary steps and/or insert arbitrary steps
Communication channel may omit messages, insert arbitrary messages and/or make arbitrary change to messages
- But these are less common observed, as the internet protocols include some checks/mitigation for this

Timing Failures

In a synchronous system time limits may be set on the performance of certain tasks
A timing failure in is a failure to perform this task within the specified time limit
To create a synchronous distributed system requires real-time operation systems and networks with high-reliability and performance guarantees

Coping with Failure

Masking and mitigating failures

Each component in a distributed system is typically constructed from a collection of other componenets
A reliable system can be constructed from unreliable componenets
If part of the system can detect the failure then it may be able to mask the failure, either concealing it entirely or converting to a more acceptable type of failure

Examples

Multiple servers holding copies of the same data can continue to provide a service when one of the servers has crashed
Message checksums detect corrupt packets, which are discarded, converting an arbitrary failure to an omission failure
....

Fault tolerance

Failures - "When the behaviour of a system deviates from that which is specified for it" / "deviation from desired or intended state" Errors - Internal discrepancy between intended and actual behaviour, caused by a fault Faults - A defect in the system Fault tolerance - "the ability to operate correctly in the presence of faults"

Failure detection

Typically a process will need to detect a failure (or error) in another process or communication channel before it can respond to it
Different failures are detected in different ways
- Checksums for corrupted data/messages
- Sequence numbers for non-existent or duplicate messages;
- Timestamps and timeouts for lost messages or crashed processes
Some failures cannot be detected with certainty in an asynchronous system

Achieving fault tolerance

The system might try to detect a fault then re-try the failed work
And/or extra work can be done from the outside "just in case" some of it will fail

Replication

Having entire extra copies of data and/or processes. Not the only form of redundancy
Do fault replication for fault tolerance, high availability, and performance

Networks and Reliability

Performance of communication channels

Communication can be realised with different paradigms and technologies, but each has:
Reliability or delivery ratio: fraction of messages successfully delivered
Latency: The delay between the start of a message transmission and beginning of its receipt
Bandwidth: Total information transmitted in a unit time
Jitter: Variation in delay

Issues for distributed systems

Distributed systems by definition involve networks, and characteristics of those networks impact the distributed system in particular:

Performance: The speed at which messages can be transferred depends on the latency and bandwith of the underlying networks
Scalability: If the network does not scale then the distributed system cannot scale. Fortunately the internet is caling quite well, but work is ongoing to improve this
Reliability: Reliable systems can be built from unreliable parts. But unreliable networks still typically limit performance
Mobility: The internet has only limited support for devices moving between networks, so additional support can be needed from the distributed system
Quality of Service: To make QoS guarantees typically requires that the network can also make such guarantees, which many networks cannot.
Multicast: Applications can simulate multicast (one-to-many communication) to some extend, but when provided by networks it can increase (local) efficiency and reliability and reduce configuration

15. Failure Models

Process Omission Failures​

Communication omission failures​

Arbitrary Failures​

Timing Failures​

Coping with Failure​

Masking and mitigating failures​

Examples​

Fault tolerance​

Failure detection​

Achieving fault tolerance​

Replication​

Networks and Reliability​

Performance of communication channels​

Issues for distributed systems​