Byzantine fault tolerance
is a sub-field of fault tolerance research inspired by the Byzantine Generals' Problem
, which is a generalized version of the Two Generals' Problem
In computing, the Two Generals' Problem is a thought experiment meant to illustrate the pitfalls and design challenges of attempting to coordinate an action by communicating over an unreliable link...
The object of Byzantine fault tolerance is to be able to defend against Byzantine failures
, in which components of a system fail in arbitrary ways (i.e., not just by stopping or crashing but by processing requests incorrectly, corrupting their local state, and/or producing incorrect or inconsistent outputs.). Correctly functioning components of a Byzantine fault tolerant system will be able to correctly provide the system's service assuming there are not too many Byzantine faulty components.
A Byzantine fault
is an arbitrary fault
In document ISO/CD 10303-226, a fault is defined as an abnormal condition or defect at the component, equipment, or sub-system level which may lead to a failure....
that occurs during the execution of an algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
by a distributed system. It encompasses both omission failures
(e.g., crash failures, failing to receive a request, or failing to send a response) and commission failures
(e.g., processing a request incorrectly, corrupting local state, and/or sending an incorrect or inconsistent response to a request.) When a Byzantine failure has occurred, the system may respond in any unpredictable way, unless it is designed to have Byzantine fault tolerance.
For example, if the output of one function is the input of another, then small round-off error
A round-off error, also called rounding error, is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate this error when using approximation equations and/or algorithms, especially when using finitely many...
s in the first function can produce much larger errors in the second. If the second function were fed into a third, the problem could grow even larger, until the values produced are worthless. Another example is in compiling
A compiler is a computer program that transforms source code written in a programming language into another computer language...
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
. One minor syntactical error early on in the code can produce large numbers of perceived errors later, as the parser of the compiler gets out-of-phase with the lexical and syntactic information in the source program. Such failures have brought down major Internet services. For example, in 2008 Amazon S3
Amazon S3 is an online storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces...
was brought down for several hours when a single-bit hardware error propagated through the system, and in 2009 the Ma.gnolia bookmark sharing website was shuttered after a file system error gradually corrupted the system's database beyond recovery.
In a Byzantine fault tolerant (BFT) algorithm, steps are taken by processes, the logical abstractions that represent the execution path of the algorithms. A faulty process is one that at some point exhibits any of the above failures. A process that is not faulty is correct
In theoretical computer science, correctness of an algorithm is asserted when it is said that the algorithm is correct with respect to a specification...
The Byzantine failure assumption models real-world environments in which computers and networks may behave in unexpected ways due to hardware failures, network congestion and disconnection, as well as malicious attacks. Byzantine failure-tolerant algorithms must cope with such failures and still satisfy the specifications of the problems they are designed to solve. Such algorithms are commonly characterized by their resilience t
, the number of faulty processes with which an algorithm can cope.
Many classic agreement problems, such as the Byzantine Generals' Problem, have no solution unless n
, where n
is the number of processes in the system. In other words, the algorithm can ensure correct operation only if fewer than one third of the processes are faulty.
refers to the Byzantine Generals' Problem, an agreement problem (first proposed by Marshall Pease, Robert Shostak, and Leslie Lamport
Leslie Lamport is an American computer scientist. A graduate of the Bronx High School of Science, he received a B.S. in mathematics from the Massachusetts Institute of Technology in 1960, and M.A. and Ph.D. degrees in mathematics from Brandeis University, respectively in 1963 and 1972...
in 1980) in which generals of the Byzantine Empire's
The Byzantine Empire was the Eastern Roman Empire during the periods of Late Antiquity and the Middle Ages, centred on the capital of Constantinople. Known simply as the Roman Empire or Romania to its inhabitants and neighbours, the Empire was the direct continuation of the Ancient Roman State...
army must decide unanimously whether to attack some enemy army. The problem is complicated by the geographic separation of the generals, who must communicate by sending messengers to each other, and by the presence of traitors amongst the generals. These traitors can act arbitrarily in order to achieve the following aims: trick some generals into attacking; force a decision that is not consistent with the generals' desires, e.g. forcing an attack when no general wished to attack; or confusing some generals to the point that they are unable to make up their minds. If the traitors succeed in any of these goals, any resulting attack is doomed, as only a concerted effort can result in victory.
Byzantine fault tolerance can be achieved, if the loyal (non-faulty) generals have a unanimous agreement on their strategy. Note that if the source general is correct, all loyal generals must agree upon that value. Otherwise, the choice of strategy agreed upon is irrelevant.
Several solutions were originally described by Lamport, Shostak, and Pease in 1982. They began by noting that the Generals' Problem can be reduced to solving a "Commander and Lieutenants" problem where Loyal Lieutenants must all act in unison and that their action must correspond to what the Commander ordered in the case that the Commander is Loyal. Roughly speaking, the Generals vote by treating each others' orders as votes.
- One solution considers scenarios in which messages may be forged, but which will be Byzantine-fault-tolerant as long as the number of traitorous generals does not equal or exceed one third. The impossibility of dealing with one-third or more traitors ultimately reduces to proving that the 1 Commander + 2 Lieutenants problem cannot be solved, if the Commander is traitorous. The reason is, if we have three commanders, A, B, and C, and A is the traitor: when A tells B to attack and C to retreat, and B and C send messages to each other, forwarding A's message, neither B nor C can figure out who is the traitor, since it isn't necessarily A – the other commander could have forged the message purportedly from A. It can be shown that if n is the number of generals in total, and t is the number of traitors in that n, then there are solutions to the problem only when n is greater than or equal to 3t + 1.
- A second solution requires unforgeable signature
A digital signature or digital signature scheme is a mathematical scheme for demonstrating the authenticity of a digital message or document. A valid digital signature gives a recipient reason to believe that the message was created by a known sender, and that it was not altered in transit...
s (in modern computer systems, this may be achieved in practice using public-key cryptography
Public-key cryptography refers to a cryptographic system requiring two separate keys, one to lock or encrypt the plaintext, and one to unlock or decrypt the cyphertext. Neither key will do both functions. One of these keys is published or public and the other is kept private...
), but maintains Byzantine fault tolerance in the presence of an arbitrary number of traitorous generals.
- Also presented is a variation on the first two solutions allowing Byzantine-fault-tolerant behavior in some situations where not all generals can communicate directly with each other.
Practical Byzantine fault tolerance
Byzantine fault tolerant replication protocols were long considered too expensive to be practical. Then in 1999, Miguel Castro and Barbara Liskov
Barbara Liskov is a computer scientist. She is currently the Ford Professor of Engineering in the MIT School of Engineering's Electrical Engineering and Computer Science department and an Institute Professor at the Massachusetts Institute of Technology.-Life and career:She earned her BA in...
introduced the "Practical Byzantine Fault Tolerance" (PBFT) algorithm, which provides high-performance Byzantine state machine replication, processing thousands of requests per second with sub-millisecond increases in latency.
PBFT triggered a renaissance in BFT replication research, with protocols like Q/U, HQ,, Zyzzyva, and ABsTRACTs working to lower costs and improve performance and protocols like Aardvark working to improve robustness.
UpRight is an open source library for constructing services that tolerate both crashes ("up") and Byzantine behaviors ("right") that incorporates many of these protocols' innovations.
One example of BFT in use is Bitcoin
Bitcoin is a decentralized, peer-to-peer network over which users make transactions that are tracked and verified through this network. The word Bitcoin also refers to the digital currency implemented as the currency medium for user transactions over this network...
, a peer-to-peer digital currency system. The Bitcoin network works in parallel to generate a chain of Hashcash
Hashcash is a proof-of-work system designed to limit email spam and denial-of-service attacks. It was proposed in March 1997 by Adam Back.-How it works:...
style proof-of-work. The proof-of-work chain is the key to solving the Byzantine Generals' Problem of synchronising the global view and generating computational proof of the majority consensus.
- Atomic commit
An atomic commit is an operation in which a set of distinct changes is applied as a single operation. If the changes are applied then the atomic commit is said to have succeeded. If there is a failure before the atomic commit can be completed then all of the changes completed in the atomic commit...
- Brooks–Iyengar algorithm
The Brooks–Iyengar algorithm or Brooks–Iyengar hybrid algorithm is a distributed algorithm, that improves both the precision and accuracy of the measurements taken by a distributed sensor network, even in the presence of faulty sensors. The sensor network does this by exchanging the measured...
- Consensus (computer science)
Consensus is a problem in distributed computing that encapsulates the task of group agreement in the presence of faults.In particular, any process in the group may fail at any time. Consensus is fundamental to core techniques in fault tolerance, such as state machine replication.- Problem...
- Quantum Byzantine agreement
Byzantine fault tolerant protocols are algorithms that are robust to arbitrary types of failures in distributed algorithms. With the advent and popularity of the Internet, there is a need to develop algorithms that do not require any centralized control that have some guarantee of always working...