15. Failure Models
02/05/23
- Failure models deal with how (parts of) a distributed system can fail and what can be done to detect and mask such failures
- Types of failure include:
Process Omission Failures
- When a process fails to perform an action it is supposed to do
- Most commonly the process may crash
- Other processes may detect that the process has crashed using timeouts
- A process crash is 'fail-stop' if other processes can detect with certainty it has failed
- But this is only possible (remotely) in a synchronous system
- In an asynchronous system messages showing it is alive might be lost or still delayed
Communication omission failures
- Consider a simple message communication:
- Process performs a send operation on a message by inserting it into an outgoing message buffer
- The communication channel transports message to process s incoming message buffer
- Process performs a receive operation to take message from its incoming message buffer and delivering it
- Message can be lost at each state
- e.g. due to congestion in the network, lack of space in a buffer or corruption of the message
- Resulting in send-omission failure, channel omission failure or receive omission failure, respectively
Arbitrary Failures
- Arbitrary or Byzantine failures describe the worst possible failure semantics, in which any type of error may occur
- A process may omit arbitrary steps and/or insert arbitrary steps
- Communication channel may omit messages, insert arbitrary messages and/or make arbitrary change to messages
- But these are less common observed, as the internet protocols include some checks/mitigation for this
Timing Failures
- In a synchronous system time limits may be set on the performance of certain tasks
- A timing failure in is a failure to perform this task within the specified time limit
- To create a synchronous distributed system requires real-time operation systems and networks with high-reliability and performance guarantees
Coping with Failure
Masking and mitigating failures
- Each component in a distributed system is typically constructed from a collection of other componenets
- A reliable system can be constructed from unreliable componenets
- If part of the system can detect the failure then it may be able to mask the failure, either concealing it entirely or converting to a more acceptable type of failure
Examples
- Multiple servers holding copies of the same data can continue to provide a service when one of the servers has crashed
- Message checksums detect corrupt packets, which are discarded, converting an arbitrary failure to an omission failure
- ....
Fault tolerance
Failures - "When the behaviour of a system deviates from that which is specified for it" / "deviation from desired or intended state" Errors - Internal discrepancy between intended and actual behaviour, caused by a fault Faults - A defect in the system Fault tolerance - "the ability to operate correctly in the presence of faults"
Failure detection
- Typically a process will need to detect a failure (or error) in another process or communication channel before it can respond to it
- Different failures are detected in different ways
- Checksums for corrupted data/messages
- Sequence numbers for non-existent or duplicate messages;
- Timestamps and timeouts for lost messages or crashed processes
- Some failures cannot be detected with certainty in an asynchronous system
Achieving fault tolerance
- The system might try to detect a fault then re-try the failed work
- And/or extra work can be done from the outside "just in case" some of it will fail
Replication
- Having entire extra copies of data and/or processes. Not the only form of redundancy
- Do fault replication for fault tolerance, high availability, and performance
Networks and Reliability
Performance of communication channels
- Communication can be realised with different paradigms and technologies, but each has:
- Reliability or delivery ratio: fraction of messages successfully delivered
- Latency: The delay between the start of a message transmission and beginning of its receipt
- Bandwidth: Total information transmitted in a unit time
- Jitter: Variation in delay
Issues for distributed systems
Distributed systems by definition involve networks, and characteristics of those networks impact the distributed system in particular:
- Performance: The speed at which messages can be transferred depends on the latency and bandwith of the underlying networks
- Scalability: If the network does not scale then the distributed system cannot scale. Fortunately the internet is caling quite well, but work is ongoing to improve this
- Reliability: Reliable systems can be built from unreliable parts. But unreliable networks still typically limit performance
- Mobility: The internet has only limited support for devices moving between networks, so additional support can be needed from the distributed system
- Quality of Service: To make QoS guarantees typically requires that the network can also make such guarantees, which many networks cannot.
- Multicast: Applications can simulate multicast (one-to-many communication) to some extend, but when provided by networks it can increase (local) efficiency and reliability and reduce configuration