Fault-tolerant design
Encyclopedia
In engineering
Engineering
Engineering is the discipline, art, skill and profession of acquiring and applying scientific, mathematical, economic, social, and practical knowledge, in order to design and build structures, machines, devices, systems, materials and processes that safely realize improvements to the lives of...

, fault-tolerant design is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails
Failure
Failure refers to the state or condition of not meeting a desirable or intended objective, and may be viewed as the opposite of success. Product failure ranges from failure to sell the product to fracture of the product, in the worst cases leading to personal injury, the province of forensic...

. The term is most commonly used to describe computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

-based systems designed to continue more or less fully operational with, perhaps, a reduction in throughput
Throughput
In communication networks, such as Ethernet or packet radio, throughput or network throughput is the average rate of successful message delivery over a communication channel. This data may be delivered over a physical or logical link, or pass through a certain network node...

 or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware
Hardware
Hardware is a general term for equipment such as keys, locks, hinges, latches, handles, wire, chains, plumbing supplies, tools, utensils, cutlery and machine parts. Household hardware is typically sold in hardware stores....

 or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured. A structure is able to retain its integrity in the presence of damage due to causes such as fatigue
Fatigue (material)
'In materials science, fatigue is the progressive and localized structural damage that occurs when a material is subjected to cyclic loading. The nominal maximum stress values are less than the ultimate tensile stress limit, and may be below the yield stress limit of the material.Fatigue occurs...

, corrosion
Corrosion
Corrosion is the disintegration of an engineered material into its constituent atoms due to chemical reactions with its surroundings. In the most common use of the word, this means electrochemical oxidation of metals in reaction with an oxidant such as oxygen...

, manufacturing flaws, or impact.

Components

If each component, in turn, can continue to function when one of its subcomponents fails, this will allow the total system to continue to operate, as well. Using a passenger vehicle as an example, a car can have "run-flat" tires, which each contain a solid rubber core, allowing them to be used even if a tire is punctured. The punctured "run-flat" tire may be used for a limited time at a reduced speed.

Redundancy

This means having backup components which automatically "kick in" should one component fail. For example, large cargo trucks can lose a tire without any major consequences. They have many tires, and no one tire is critical (with the exception of the front tires, which are used to steer).

When to use

Providing fault-tolerant design for every component is normally not an option. In such cases the following criteria may be used to determine which components should be fault-tolerant:
  • How critical is the component? In a car, the radio is not critical, so this component has less need for fault-tolerance.

  • How likely is the component to fail? Some components, like the drive shaft in a car, are not likely to fail, so no fault-tolerance is needed.

  • How expensive is it to make the component fault-tolerant? Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.


An example of a component that passes all the tests is a car's occupant restraint system. While we do not normally think of the primary occupant restraint system, it is gravity. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so we pass the first test. Accidents causing occupant ejection were quite common before seat belt
Seat belt
A seat belt or seatbelt, sometimes called a safety belt, is a safety harness designed to secure the occupant of a vehicle against harmful movement that may result from a collision or a sudden stop...

s, so we pass the second test. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms or weight and space, so we pass the third test. Therefore, adding seat belts to all vehicles is an excellent idea. Other "supplemental restraint systems", such as airbag
Airbag
An Airbag is a vehicle safety device. It is an occupant restraint consisting of a flexible envelope designed to inflate rapidly during an automobile collision, to prevent occupants from striking interior objects such as the steering wheel or a window...

s, are more expensive and so pass that test by a smaller margin.

Examples

Hardware fault-tolerance sometimes requires that broken parts can be taken out with new old while the system is still operational (in computing known as hot swapping
Hot swapping
Hot swapping and hot plugging are terms used to describe the functions of replacing computer system components without shutting down the system...

). Such a system implemented with a single backup is known as single point tolerant, and represents the vast majority of fault-tolerant systems. In such systems the mean time between failure
Mean time between failure
Mean time between failures is the predicted elapsed time between inherent failures of a system during operation. MTBF can be calculated as the arithmetic mean time between failures of a system. The MTBF is typically part of a model that assumes the failed system is immediately repaired , as a...

s should be long enough for the operators to have time to fix the broken devices (mean time to repair
Mean time to repair
Mean time to repair is a basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device. Expressed mathematically, it is the total corrective maintenance time divided by the total number of corrective maintenance actions...

)
before the backup also fails. It helps if the time between failures is as long as possible, but this is not specifically required in a fault-tolerant system.

Fault-tolerance is notably successful in computer applications. Tandem Computers
Tandem Computers
Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded in...

 built their entire business on such machines, which used single point tolerance to create their NonStop systems with uptime
Uptime
Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

s measured in years.

Fail-safe architectures may encompass also the computer software, for example by process replication (computer science)
Replication (computer science)
Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility. It could be data replication if the same data is stored on multiple storage devices, or...

.

Disadvantages

Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:
  • Interference with fault detection in the same component. To continue the above passenger vehicle example, it may not be obvious to the driver when a tire has been punctured, with either of the fault-tolerant systems. This is usually handled with a separate "automated fault detection system". In the case of the tire, an air pressure monitor detects the loss of pressure and notifies the driver. The alternative is a "manual fault detection system", such as manually inspecting all tires at each stop.

  • Interference with fault detection in another component. Another variation of this problem is when fault-tolerance in one component prevents fault detection in a different component. For example, if component B performs some operation based on the output from component A, then fault-tolerance in B can hide a problem with A. If component B is later changed (to a less fault-tolerant design) the system may fail suddenly, making it appear that the new component B is the problem. Only after the system has been carefully scrutinized will it become clear that the root problem is actually with component A.

  • Reduction of priority of fault correction. Even if the operator is aware of the fault, having a fault-tolerant system is likely to reduce the importance of repairing the fault. If the faults are not corrected, this will eventually lead to system failure, when the fault-tolerant component fails completely or when all redundant components have also failed.

  • Test difficulty. For certain critical fault-tolerant systems, such as a nuclear reactor
    Nuclear reactor
    A nuclear reactor is a device to initiate and control a sustained nuclear chain reaction. Most commonly they are used for generating electricity and for the propulsion of ships. Usually heat from nuclear fission is passed to a working fluid , which runs through turbines that power either ship's...

    , there is no easy way to verify that the backup components are functional. The most infamous example of this is Chernobyl
    Chernobyl disaster
    The Chernobyl disaster was a nuclear accident that occurred on 26 April 1986 at the Chernobyl Nuclear Power Plant in Ukraine , which was under the direct jurisdiction of the central authorities in Moscow...

    , where operators tested the emergency backup cooling by disabling primary and secondary cooling. The backup failed, resulting in a core meltdown and massive release of radiation.

  • Cost. Both fault-tolerant components and redundant components tend to increase cost. This can be a purely economic cost or can include other measures, such as weight. Manned spaceships
    Human spaceflight
    Human spaceflight is spaceflight with humans on the spacecraft. When a spacecraft is manned, it can be piloted directly, as opposed to machine or robotic space probes and remotely-controlled satellites....

    , for example, have so many redundant and fault-tolerant components that their weight is increased dramatically over unmanned systems, which don't require the same level of safety.

  • Inferior components. A fault-tolerant design may allow for the use of inferior components, which would have otherwise made the system inoperable. While this practice has the potential to mitigate the cost increase, use of multiple inferior components may lower the reliability of the system to a level equal to, or even worse than, a comparable non-fault-tolerant system.

Related terms

There is a difference between fault-tolerance and systems that rarely have problems. For instance, the Western Electric
Western Electric
Western Electric Company was an American electrical engineering company, the manufacturing arm of AT&T from 1881 to 1995. It was the scene of a number of technological innovations and also some seminal developments in industrial management...

 crossbar
Crossbar switch
In electronics, a crossbar switch is a switch connecting multiple inputs to multiple outputs in a matrix manner....

 systems had failure rates of two hours per forty years, and therefore were highly fault resistant. But when a fault did occur they still stopped operating completely, and therefore were not fault-tolerant.

See also

  • Error-tolerant design
    Error-tolerant design
    An error-tolerant design is one that does not unduly penalize user errors. It is the human equivalent of fault tolerant design that allows equipment to continue functioning in the presence of hardware faults, such as a "limp-in" mode for an automobile electronics unit that would be employed if...

  • Capillary routing
    Capillary routing
    In networking and in graph theory, capillary routing, for a given network, is a multi-path solution between a pair of source and destination nodes...

  • Fail-safe
    Fail-safe
    A fail-safe or fail-secure device is one that, in the event of failure, responds in a way that will cause no harm, or at least a minimum of harm, to other devices or danger to personnel....

  • Fail soft
    Fail soft
    Fail-soft operation is a characteristic of computing that refers to the ability of a system to fail in such a way as to preserve as much capability and data as possible....

  • Fault-tolerant system
    Fault-tolerant system
    Fault-tolerance or graceful degradation is the property that enables a system to continue operating properly in the event of the failure of some of its components. A newer approach is progressive enhancement...

  • Graceful degradation
  • Safe-life design
    Safe-life design
    In safe-life design products are designed to survive a specific design life with a chosen reserve.The Safe-life design technique is employed in critical systems which are either very difficult to repair or may cause severe damage to life and property...

  • Separation of protection and security
    Separation of protection and security
    In computer sciences the separation of protection and security is a design choice. Wulf et al. identified protection as a mechanism and security as a policy, therefore making the protection-security distinction a particular case of the separation of mechanism and policy principle.- Overview :The...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK