High availability
Encyclopedia
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.

Users want their systems, for example wrist watches, hospitals, airplanes or computers, to be ready to serve them at all times. Availability
Availability
In telecommunications and reliability theory, the term availability has the following meanings:* The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time...

 refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime
Downtime
The term downtime is used to refer to periods when a system is unavailable.Downtime or outage duration refers to a period of time that a system fails to provide or perform its primary function...

is used to refer to periods when a system is unavailable.

Scheduled and Unscheduled Downtime

A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software
System software
System software is computer software designed to operate the computer hardware and to provide a platform for running application software.The most basic types of system software are:...

 that require a reboot
Booting
In computing, booting is a process that begins when a user turns on a computer system and prepares the computer to perform its normal operations. On modern computers, this typically involves loading and starting an operating system. The boot sequence is the initial set of operations that the...

 or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM
Ram
-Animals:*Ram, an uncastrated male sheep*Ram cichlid, a species of freshwater fish endemic to Colombia and Venezuela-Military:*Battering ram*Ramming, a military tactic in which one vehicle runs into another...

 components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, catastrophic security breaches, or various application
Application software
Application software, also known as an application or an "app", is computer software designed to help the user to perform specific tasks. Examples include enterprise software, accounting software, office suites, graphics software and media players. Many application programs deal principally with...

, middleware
Middleware
Middleware is computer software that connects software components or people and their applications. The software consists of a set of services that allows multiple processes running on one or more machines to interact...

, and operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

 failures.

Many computing sites exclude scheduled downtime from availability calculations, assuming, correctly or incorrectly, that scheduled downtime has little or no impact upon the computing user community. By excluding scheduled downtime, many systems can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure
Single point of failure
A single point of failure is a part of a system that, if it fails, will stop the entire system from working. They are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.-Overview:Systems can be made...

 and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night.

Percentage calculation

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreement
Service Level Agreement
A service-level agreement is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time or performance...

s often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable per year, month, or week.
Uptime
Uptime
Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

 and availability
Availability
In telecommunications and reliability theory, the term availability has the following meanings:* The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time...

 are not synonymous. A system can be up, but not available, as in the case of a network outage.

In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...

 (like 0.00001), or a downtime
Downtime
The term downtime is used to refer to periods when a system is unavailable.Downtime or outage duration refers to a period of time that a system fails to provide or perform its primary function...

 per year is quoted. Availability specified as a number of nines is often seen in marketing
Marketing
Marketing is the process used to determine what products or services may be of interest to customers, and the strategy to use in sales, communications and business development. It generates the strategy that underlies sales techniques, business communication, and business developments...

 documents.

The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.

Measurement and interpretation

Clearly, how availability is measured is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% "uptime." However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true availability measure is holistic.

Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand.

Closely related concepts

Recovery time (or estimated time of repair (ETR)) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery
Disaster recovery
Disaster recovery is the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster. Disaster recovery is a subset of business continuity...

 data center.

Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management specialists often focus separately on data availability in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.

A service level agreement
Service Level Agreement
A service-level agreement is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time or performance...

 ("SLA") formalizes an organization's availability objectives and requirements.

System design for high availability

Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability. That is because complex systems inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardware redundancy); however, this architecture suffers from the requirement that the entire system must be brought down for patching and Operating System upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing
Load balancing (computing)
Load balancing is a computer networking methodology to distribute workload across multiple computers or a computer cluster, network links, central processing units, disk drives, or other resources, to achieve optimal resource utilization, maximize throughput, minimize response time, and avoid...

 and failover
Failover
In computing, failover is automatic switching to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network...

).

High availability implies no human intervention to restore operation in complex systems. For example, availability limit of 99.999% allows about one second of down time per day, which is impractical using human labor. The need for human intervention for maintenance actions in a large system will exceed this limit. Availability limit of 99% would allow an average of 15 minutes per day, which is realistic for human intervention.

Redundancy (engineering)
Redundancy (engineering)
In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe....

 is used to eliminate the need for human intervention. The two kinds of redundancy are passive redundancy and active redundancy.

Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving electric power transmission
Electric power transmission
Electric-power transmission is the bulk transfer of electrical energy, from generating power plants to Electrical substations located near demand centers...

. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system.

Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet routing
Routing
Routing is the process of selecting paths in a network along which to send network traffic. Routing is performed for many kinds of networks, including the telephone network , electronic data networks , and transportation networks...

 is derived from early work by Birman and Joseph in this area. Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic.

Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds the period of time between planned maintenance, upgrade
Upgrade
The term upgrade refers to the replacement of a product with a newer version of the same product. It is most often used in computing and consumer electronics, generally meaning a replacement of hardware, software or firmware with a newer or better version, in order to bring the system up to date...

 events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of communications satellite
Communications satellite
A communications satellite is an artificial satellite stationed in space for the purpose of telecommunications...

. Global Positioning System
Global Positioning System
The Global Positioning System is a space-based global navigation satellite system that provides location and time information in all weather, anywhere on or near the Earth, where there is an unobstructed line of sight to four or more GPS satellites...

 is an example of a zero downtime system.

Fault instrumentation
Instrumentation
Instrumentation is defined as the art and science of measurement and control of process variables within a production, or manufacturing area....

 can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of down-time only after a fault indicator activates. Failure is only significant if this occurs during a mission critical
Mission Critical
Mission critical refers to any factor of a system whose failure will result in the failure of business operations. That is, it is critical to the organization's 'mission'....

 period. This strategy is called Condition-based maintenance, and this is only effective with active redundancy.

Modeling and simulation
Modeling and simulation
Modeling and simulation is the use of models, including emulators, prototypes, simulators, and stimulators, either statically or over time, to develop data as a basis for making managerial or technical decisions. The terms "modeling" and "simulation" are often used interchangeably.The use of...

 is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.

Reasons for unavailability

A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems, from most to least important, as follows:
Causal factor of unavailability
Lack of best practice change control
Change Control
Change control within Quality management systems and Information Technology systems is a formal process used to ensure that changes to a product or system are introduced in a controlled and coordinated manner...

Lack of best practice monitoring of the relevant components
Lack of best practice requirements
Requirements management
Requirements management is the process of documenting, analyzing, tracing, prioritizing and agreeing on requirements and then controlling change and communicating to relevant stakeholders. It is a continuous process throughout a project...

 and procurement
Lack of best practice operations
Lack of best practice avoidance of network failures
Lack of best practice avoidance of internal application failures
Lack of best practice avoidance of external services that fail
Lack of best practice physical environment
Lack of best practice network redundancy
Lack of best practice technical solution of backup
Lack of best practice process solution of backup
Lack of best practice physical location
Lack of best practice infrastructure redundancy
Lack of best practice storage architecture redundancy


The factors themselves are based on the work of Evan Marcus & Hal Stern.

Costs of unavailability

In a 1998 report from IBM Global Services, unavailable systems are estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.

See also

  • Business continuity planning
    Business continuity planning
    Business continuity planning “identifies [an] organization's exposure to internal and external threats and synthesizes hard and soft assets to provide effective prevention and recovery for the organization, whilst maintaining competitive advantage and value system integrity”. It is also called...

  • Carrier grade
    Carrier grade
    In telecommunication, a "carrier grade" or "carrier class" refers to a system, or a hardware or software component that is extremely reliable, well tested and proven in its capabilities...

  • Disaster recovery
    Disaster recovery
    Disaster recovery is the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster. Disaster recovery is a subset of business continuity...

  • Fault-tolerant system
    Fault-tolerant system
    Fault-tolerance or graceful degradation is the property that enables a system to continue operating properly in the event of the failure of some of its components. A newer approach is progressive enhancement...

  • Reliability (computer networking)
    Reliability (computer networking)
    In computer networking, a reliable protocol is one that provides reliability properties with respect to the delivery of data to the intended recipient, as opposed to an unreliable protocol, which does not provide notifications to the sender as to the delivery of transmitted data.A reliable...

  • High-availability cluster
    High-availability cluster
    High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum of down-time. They operate by harnessing redundant computers in groups or clusters that provide continued service when system components fail...

  • Service Availability Forum
    Service Availability Forum
    The Service Availability Forum is a consortium that develops, publishes, educates on and promotes open specifications for carrier-grade and mission-critical systems. SA Forum specifications enable faster development and deployment of commercial off-the-shelf ecosystems for highly available...

  • Uptime
    Uptime
    Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...


External links


The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK