Fencing (computing)
Encyclopedia
Fencing is the process of isolating a node
Node (networking)
In communication networks, a node is a connection point, either a redistribution point or a communication endpoint . The definition of a node depends on the network and protocol layer referred to...

 of a computer cluster when the former is malfunctioning. Isolating a node means ensuring that I/O
I/O
I/O may refer to:* Input/output, a system of communication for information processing systems* Input-output model, an economic model of flow prediction between sectors...

 can no longer be done from it. Fencing is typically done automatically, by cluster infrastructure such as shared disk file systems, in order to protect processes
Process (computing)
In computing, a process is an instance of a computer program that is being executed. It contains the program code and its current activity. Depending on the operating system , a process may be made up of multiple threads of execution that execute instructions concurrently.A computer program is a...

 from other active nodes modifying the resources during node failures. Mechanisms to support fencing, such as the reserve/release mechanism of SCSI, have existed since at least 1985 .

Fencing is required because it is impossible to distinguish between a real failure and a temporary hang
Hang (computing)
In computing, a hang or freeze occurs when either a single computer program, or the whole system ceases to respond to inputs. In the most commonly encountered scenario, a workstation with a graphical user interface, all windows belonging to the frozen program become static, and though the mouse...

. If the malfunctioning node is really down, then it cannot do any damage, so theoretically no action would be required (it could simply be brought back into the cluster with the usual join process). However, because there is a possibility that a malfunctioning node could itself consider the rest of the cluster to be the one that is malfunctioning, a race condition
Race condition
A race condition or race hazard is a flaw in an electronic system or process whereby the output or result of the process is unexpectedly and critically dependent on the sequence or timing of other events...

 could ensue, and cause data corruption
Data corruption
Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data...

. Instead, the system has to assume the worst scenario and always fence in case of problems.

Fencing methods include:
  • STONITH
    STONITH
    STONITH , sometimes called STOMITH , is a technique for fencing in computer clusters. Fencing is the isolation of a failed node so that it does not cause disruption to a cluster...

    , which stands for "Shoot The Other Node In The Head", meaning automatically power off the server
  • reserve/release ('R/R')
  • persistent reservation (SCSI3)
  • SAN
    Storage area network
    A storage area network is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices...

     Fabric fencing, which is widely used both by Red Hat Global File System
    Global File System
    In computing, the Global File System is a shared disk file system for Linux computer clusters. This is not to be confused with the Google File System, a proprietary distributed filesystem developed by Google....

     (GFS) and the PolyServe File System (PSFS)


Reserve/release by its nature only works with two-node cluster
Two-node cluster
A two-node cluster is the minimal high-availability cluster that can be built.Should one node fail , the other must acquire the resources being previously managed by the failed node, in order to re-enable access to these resources...

s, because one of the two nodes in the cluster, upon detecting that the other node has 'failed', will issue the reserve and grab all the disks for itself. The other node will commit suicide if it tries to do I/O (in case it was temporarily hung). The I/O failure triggers some code to kill the node. In general, in the case of two-node clusters, R/R is sufficient to address the split-brain issue, also.

For clusters greater than two nodes, R/R fencing does not work very well because it would cause all the nodes but one to commit suicide. In those cases persistent reservation is used. Persistent reservation is essentially a match on a key, so the node which has the right key can do I/O, otherwise its I/O fails. Therefore, it is sufficient to change the key on a failure to ensure the right behavior during failure.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK