Skip to main content

15. Failure Models

02/05/23

MoodlePDF

  • Failure models deal with how (parts of) a distributed system can fail and what can be done to detect and mask such failures
  • Types of failure include:

Process Omission Failures

  • When a process fails to perform an action it is supposed to do
  • Most commonly the process may crash
  • Other processes may detect that the process has crashed using timeouts
  • A process crash is 'fail-stop' if other processes can detect with certainty it has failed
    • But this is only possible (remotely) in a synchronous system
    • In an asynchronous system messages showing it is alive might be lost or still delayed

Communication omission failures

  • Consider a simple message communication:
    • Process pp performs a send operation on a message mm by inserting it into an outgoing message buffer
    • The communication channel transports message mm to process qqs incoming message buffer
    • Process qq performs a receive operation to take message mm from its incoming message buffer and delivering it
  • Message can be lost at each state
    • e.g. due to congestion in the network, lack of space in a buffer or corruption of the message
    • Resulting in send-omission failure, channel omission failure or receive omission failure, respectively

Arbitrary Failures

  • Arbitrary or Byzantine failures describe the worst possible failure semantics, in which any type of error may occur
  • A process may omit arbitrary steps and/or insert arbitrary steps
  • Communication channel may omit messages, insert arbitrary messages and/or make arbitrary change to messages
    • But these are less common observed, as the internet protocols include some checks/mitigation for this

Timing Failures

  • In a synchronous system time limits may be set on the performance of certain tasks
  • A timing failure in is a failure to perform this task within the specified time limit
  • To create a synchronous distributed system requires real-time operation systems and networks with high-reliability and performance guarantees

Coping with Failure

Masking and mitigating failures

  • Each component in a distributed system is typically constructed from a collection of other componenets
  • A reliable system can be constructed from unreliable componenets
  • If part of the system can detect the failure then it may be able to mask the failure, either concealing it entirely or converting to a more acceptable type of failure

Examples

  • Multiple servers holding copies of the same data can continue to provide a service when one of the servers has crashed
  • Message checksums detect corrupt packets, which are discarded, converting an arbitrary failure to an omission failure
  • ....

Fault tolerance

Failures - "When the behaviour of a system deviates from that which is specified for it" / "deviation from desired or intended state" Errors - Internal discrepancy between intended and actual behaviour, caused by a fault Faults - A defect in the system Fault tolerance - "the ability to operate correctly in the presence of faults"

Failure detection

  • Typically a process will need to detect a failure (or error) in another process or communication channel before it can respond to it
  • Different failures are detected in different ways
    • Checksums for corrupted data/messages
    • Sequence numbers for non-existent or duplicate messages;
    • Timestamps and timeouts for lost messages or crashed processes
  • Some failures cannot be detected with certainty in an asynchronous system

Achieving fault tolerance

  • The system might try to detect a fault then re-try the failed work
  • And/or extra work can be done from the outside "just in case" some of it will fail

Replication

  • Having entire extra copies of data and/or processes. Not the only form of redundancy
  • Do fault replication for fault tolerance, high availability, and performance

Networks and Reliability

Performance of communication channels

  • Communication can be realised with different paradigms and technologies, but each has:
  • Reliability or delivery ratio: fraction of messages successfully delivered
  • Latency: The delay between the start of a message transmission and beginning of its receipt
  • Bandwidth: Total information transmitted in a unit time
  • Jitter: Variation in delay

Issues for distributed systems

Distributed systems by definition involve networks, and characteristics of those networks impact the distributed system in particular:

  • Performance: The speed at which messages can be transferred depends on the latency and bandwith of the underlying networks
  • Scalability: If the network does not scale then the distributed system cannot scale. Fortunately the internet is caling quite well, but work is ongoing to improve this
  • Reliability: Reliable systems can be built from unreliable parts. But unreliable networks still typically limit performance
  • Mobility: The internet has only limited support for devices moving between networks, so additional support can be needed from the distributed system
  • Quality of Service: To make QoS guarantees typically requires that the network can also make such guarantees, which many networks cannot.
  • Multicast: Applications can simulate multicast (one-to-many communication) to some extend, but when provided by networks it can increase (local) efficiency and reliability and reduce configuration