A Few Notes on Distributed Systems Models and Failures

While the terms “distributed systems” and “distributed computing” are often used interchangeably, for my purposes, I’ve found it useful to draw a distinction. Distributed systems research is often focused on how to get multiple components of a system to act together in a reliable way, whereas distributed computing research is focused on how to use multi-component systems to solve a problem by breaking it down into different parts. While distributed systems is more of an engineering discipline focused on systems architecture, distributed computing is more of a mathematical discipline focused on the design of algorithms.¹

Consensus and agreement are important problems in both distributed computing and distributed systems. The defining property of distributed systems research is that it cannot afford to idealize the world and assume that the system will function properly. Networks are inherently unreliable, and computers may crash, malfunction, or be compromised by attackers. The goal, therefore, is to design protocols that will allow the distributed system to function correctly as a whole even if some parts of the system are malfunctioning. This means that communication complexity means very different things in distributed systems than it does in distributed computing. The communication complexity of consensus or agreement protocol, is not the cost of computing the answer to a specific question, but rather the communication cost of maintaining reliability within the distributed system.

Goals of Distributed Systems

Developing distributed systems is a task with many pitfalls. One cannot assume that the network will be secure, reliable, or homogeneous; one cannot assume that the topology of the network will not change; one cannot assume that latency will be zero, that bandwidth is infinite, or that transport cost is negligible. In fact, most of these assumptions are false. The internet, for example, is insecure and unreliable by design.

Three important concepts in distributed systems are faults, failures, and partial failures. A fault in a system occurs when one part of the system is deviating from its specification, whereas a failure occurs when a system as a whole stops providing an intended service. A partial failure occurs in a distributed system when there are some parts of the system that are broken in an unpredictable way, but the rest of the system is working fine. The goal is to design reliable and resilient systems that can anticipate and cope with certain classes of faults. Such systems are known as fault-tolerant systems.

Most distributed systems are shared-nothing systems; that is, each machine has its own memory and disk, and communicates with other machines via a network. Since most networks are asynchronous packet networks, where a message is not guaranteed to arrive at its intended recipient, network faults are extremely common. This should serve as at least one example of why partial failures are common in distributed systems.

Types of Failures

Fail-stop failure. This is when a process fails by halting execution at some point in time. Other processes are notified that the halted process has failed.

Crash failure. This is when a process fails by halting execution at some point in time. Unlike a fail-stop failure, other processes are not notified that the halted process has failed.

Receive omission failure. This is when a process fails either by crashing or by intermittently receiving only some of the messages sent to it.

Send omission failure. This is when a process fails either by crashing or by intermittently sending only some of the messages it is supposed to send.

General omission failure. This is when a process fails either by exhibiting receive omission failure, or by exhibiting send omission failure, or by otherwise failing to respond to incoming requests.

Timing failure. This is when a process fails to respond within the specified timing interval.

Value failure. This is when a process provides incorrect responses to requests.

State transition failure. This is when a process deviates from the correct flow of control when acting in response to an incoming request.

Response failure. This is when a process exhibits value failure, or state transition failure, or both.

Byzantine (or arbitrary) failure, with authentication. This is when a process can exhibit any arbitrary behaviour at arbitrary times, including malicious behaviours such as lying. In this case, when a faulty process claims to have received a message from a correct process, that claim can be verified using cryptographic signatures.

Byzantine (or arbitrary) failure. This is when a process can exhibit any arbitrary behaviour at arbitrary times, including malicious behaviours such as lying. However, there is no authentication.

System Models

Many algorithms have been designed to solve distributed systems problems, but in order to be useful, they must tolerate the various faults of distributed systems. The types of faults that we expect an algorithm to be able to handle are codified using system models. Three models are often used with regards to timing:

Synchronous model

This model assumes that network delay, process pauses, and clock error are all bounded. Formally, there is a fixed upper bound $\Delta$ on the time for messages to be delivered (synchronous communication) and there is a fixed upper bound $\Phi$ on the rate at which one processor’s clock is allowed to run faster than another processor’s clock (processors are synchronous). This is not a realistic model for most practical systems because unbounded delays and pauses occur in real life.

Asynchronous model

The algorithm is not allowed to make any timing assumptions (i.e. everything is unbounded). Designing algorithms under the asynchronous model is difficult; in fact, consensus is known to be impossible to achieve under this model.

Partially synchronous model

This model assumes that the system behaves like a synchronous system the majority of the time. However, it may occasionally exceed the bounds for network delay, process pauses, or clock drift. This is a much more realistic model. Formally, this can happen in two different ways:

The upper bounds $\Delta$ and $\Phi$ exist, but are unknown.
The system is asynchronous up until some global stabilization time (GST), unknown to the processors, after which $\Delta$ and $\Phi$ must subsequently hold.

Additionally, three system models are often used with regard to single node behaviour:

Crash-stop faults

In this model, an algorithm is allowed to assume that nodes will only fail by crashing. This means that the nodes may suddenly stop responding at any point in time and subsequently never come back online.

Crash-recovery faults

In this model, a node may crash at any time, and perhaps come back online at some unknown period of time.

Byzantine (arbitrary) faults

In this model, node are allowed to do anything, including tricking and deceiving other nodes.

The Byzantine Generals Problem

The Byzantine Generals Problem was introduced by Leslie Lamport in 1982 as a theoretical question that was meant to inform the design of reliable systems. The problem is named for the Byzantine empire, and imagines a situation where a group of generals of the Byzantine army are parked outside an enemy city, and they need to coordinate their actions in order to decide whether to attack or retreat. These generals can only communicate via oral messengers. However, these messengers could possibly lie or be corrupted by traitors. Additionally, some of the generals may be traitorous themselves, and may provide false or conflicting information to disrupt the battle plan. The goal was to create an algorithm that would allow all loyal generals to agree on a single plan.

In this version of the problem, one of the generals is a commander, who broadcasts an order to all of the other generals. The generals must agree on which order to follow. Without the presence of cryptography, this problem is unsolvable unless more than two-thirds of the commanders are loyal. When each message is signed using a digital signature, this problem is solvable as long as at least one of the generals is loyal. In the consensus version of the problem, each general already holds an instruction, and the loyal generals are trying to determine whether they all hold the same order. This version of the problem is known as Byzantine agreement.

The concept of Byzantine faults is named after this problem, and Byzantine fault-tolerant protocols are protocols that are resilient to nodes exhibiting arbitrary behaviours.

Useful References

Some useful references I read while writing these notes:

Martin Kleppmann. Designing Data-intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media, 2017.
Ajay D. Kshemkalyani and Mukesh Singhal. Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, 2011.
Leslie Lamport, Robert Shostak, and Marshall Pease. “The Byzantine Generals Problem”. In: ACM Trans. Program. Lang. Syst. 4.3 (July 1982), pp. 382–401. url: https://doi.org/10.1145/357172.357176
Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems: Principles and Paradigms (2nd Edition). USA: Prentice-Hall, Inc., 2006.
Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. “Consensus in the presence of partial synchrony”. In: J. ACM 35.2 (Apr. 1988), pp. 288–323. url: https://doi.org/10.1145/42282.42283
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. “Impossibility of distributed consensus with one faulty process”. In: J. ACM 32.2 (Apr. 1985), pp. 374–382. url: https://doi.org/10.1145/3149.214121

Of course, like all distinctions of this type, this definition is somewhat arbitrary and false. You will find both types of work being done by people who claim either disciplinary label, because real problems don’t care about disciplinary labels. ↩︎