Downtime
Encyclopedia
The term downtime is used to refer to periods when a system is unavailable.
Downtime or outage duration refers to a period of time that a system
System
System is a set of interacting or interdependent components forming an integrated whole....

 fails to provide or perform its primary function. Reliability
Reliability engineering
Reliability engineering is an engineering field, that deals with the study, evaluation, and life-cycle management of reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often measured as a probability of...

, availability
Availability
In telecommunications and reliability theory, the term availability has the following meanings:* The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time...

, recovery
Recovery
-Health:* Healing* Cure* The Recovery model of mental distress/disorder* Recovery International, a self-help mental health program based on the work of the late Abraham A...

, and unavailability
Unavailability
Unavailability can be defined as the probability that an item will not operate correctly at a given time and under specified conditions. It opposes availability....

 are related concepts.
The unavailability is the proportion of a timespan that a system
System
System is a set of interacting or interdependent components forming an integrated whole....

 is unavailable or offline.
This is usually a result of the system failing to function
Failure
Failure refers to the state or condition of not meeting a desirable or intended objective, and may be viewed as the opposite of success. Product failure ranges from failure to sell the product to fracture of the product, in the worst cases leading to personal injury, the province of forensic...

 because of an unplanned event, or because of routine maintenance.

The term is commonly applied to network
Telecommunications network
A telecommunications network is a collection of terminals, links and nodes which connect together to enable telecommunication between users of the terminals. Networks may use circuit switching or message switching. Each terminal in the network must have a unique address so messages or connections...

s and servers
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...

. The common reasons for unplanned outages are system failures (such as a crash
Crash (computing)
A crash in computing is a condition where a computer or a program, either an application or part of the operating system, ceases to function properly, often exiting after encountering errors. Often the offending program may appear to freeze or hang until a crash reporting service documents...

) or communications failures (commonly known as network outage).

The term is also commonly applied in industrial environments in relation to failures in industrial production equipment. Some facilities measure the downtime incurred during a work shift, or during a 12 or 24-hour period. Another common practice is to identify each downtime event as having an operational, electrical or mechanical origin.

The opposite of downtime is uptime
Uptime
Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

.

Characteristics

Unplanned downtime may be the result of a software bug
Software bug
A software bug is the common term used to describe an error, flaw, mistake, failure, or fault in a computer program or system that produces an incorrect or unexpected result, or causes it to behave in unintended ways. Most bugs arise from mistakes and errors made by people in either a program's...

, human error
Human Error
Human Error is the stage name of Rafał Kuczynski , a polish electronic musician, working mostly in the ambient music genre, produced only with a computer...

, equipment failure
Failure
Failure refers to the state or condition of not meeting a desirable or intended objective, and may be viewed as the opposite of success. Product failure ranges from failure to sell the product to fracture of the product, in the worst cases leading to personal injury, the province of forensic...

, malfunction, high bit error rate, power failure, overload
Measuring network throughput
Throughput of a network can be measured using various tools available on different platforms. This page explains the theory behind what these tools set out to measure and the issues regarding these measurements.-Reasons for measuring throughput in networks:...

 due to exceeding the channel capacity
Channel capacity
In electrical engineering, computer science and information theory, channel capacity is the tightest upper bound on the amount of information that can be reliably transmitted over a communications channel...

, a cascading failure
Cascading failure
A cascading failure is a failure in a system of interconnected parts in which the failure of a part can trigger the failure of successive parts.- Cascading failure in power transmission :...

, etc.

Telecommunication outage classifications

Downtime can be caused by failure in
hardware
Hardware
Hardware is a general term for equipment such as keys, locks, hinges, latches, handles, wire, chains, plumbing supplies, tools, utensils, cutlery and machine parts. Household hardware is typically sold in hardware stores....

 (physical equipment),
software (logic controlling equipment),
interconnecting equipment (such as cables, facilities, routers,...),
wireless transmission (wireless, microwave, satellite), and/or
capacity (system limits).

The failures can occur because of
damage,
failure,
design,
procedural (improper use by humans),
engineering (how to use and deployment),
overload
Overload
-Bands:* Overload * Overload * Overload -Albums:*Overload *Overload *Overload -Songs:*Overload , a dance song by Voodoo and Serano...

 (traffic or system resources stressed beyond designed limits),
environment (support systems like power and HVAC),
scheduled downtime (outages designed into the system for a purpose such as software upgrades and equipment growth),
other (none of the above but known), or
unknown.

The failures can be the responsibility of
customer/service provider,
vender/supplier,
utility,
government,
contractor,
end customer,
public individual,
act of nature,
other (none of the above but known), or
unknown.

Impact

Outages caused by system failures can have a serious impact on the users of computer/network systems, in particular those industries that rely on a nearly 24-hour service:
  • medical informatics
  • nuclear power
    Nuclear power
    Nuclear power is the use of sustained nuclear fission to generate heat and electricity. Nuclear power plants provide about 6% of the world's energy and 13–14% of the world's electricity, with the U.S., France, and Japan together accounting for about 50% of nuclear generated electricity...

     and other infrastructure
    Infrastructure
    Infrastructure is basic physical and organizational structures needed for the operation of a society or enterprise, or the services and facilities necessary for an economy to function...

  • bank
    Bank
    A bank is a financial institution that serves as a financial intermediary. The term "bank" may refer to one of several related types of entities:...

    s and other financial institution
    Financial institution
    In financial economics, a financial institution is an institution that provides financial services for its clients or members. Probably the most important financial service provided by financial institutions is acting as financial intermediaries...

    s
  • aeronautics
    Aeronautics
    Aeronautics is the science involved with the study, design, and manufacturing of airflight-capable machines, or the techniques of operating aircraft and rocketry within the atmosphere...

    , airline
    Airline
    An airline provides air transport services for traveling passengers and freight. Airlines lease or own their aircraft with which to supply these services and may form partnerships or alliances with other airlines for mutual benefit...

    s
  • news reporting
    News agency
    A news agency is an organization of journalists established to supply news reports to news organizations: newspapers, magazines, and radio and television broadcasters. Such an agency may also be referred to as a wire service, newswire or news service.-History:The oldest news agency is Agence...

  • e-commerce and online transaction processing
    Online transaction processing
    Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing...

  • persistent online games
    Virtual world
    A virtual world is an online community that takes the form of a computer-based simulated environment through which users can interact with one another and use and create objects. The term has become largely synonymous with interactive 3D virtual environments, where the users take the form of...



Also affected can be the users of an ISP
Internet service provider
An Internet service provider is a company that provides access to the Internet. Access ISPs directly connect customers to the Internet using copper wires, wireless or fiber-optic connections. Hosting ISPs lease server space for smaller businesses and host other people servers...

 and other customers of a telecommunication network.

Corporations can lose business due to network outage or they may default on a contract, resulting in financial losses.

Those people or organizations that are affected by downtime can be more sensitive to particular aspects:
  • some are more affected by the length of an outage - it matters to them how much time it takes to recover from a problem
  • others are sensitive to the timing of an outage - outages during peak hours affect them the most


The most demanding users are those that require high availability
High availability
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period....

.

Famous outages

On Mother's Day
Mother's Day
Mother's Day is a celebration honoring mothers and celebrating motherhood, maternal bonds, and the influence of mothers in society. It is celebrated on various days in many parts of the world, yet most commonly in March, April, or May...

, Sunday, May 8, 1988, a fire broke out in the main switching room of the Hinsdale Central Office of the Illinois Bell
Illinois Bell
Illinois Bell is the name of the Bell Operating Company serving Illinois. It is wholly owned by AT&T.Their headquarters are at 225 West Randolph St., Chicago, IL. After the 1984 Bell System Divestiture, Illinois Bell became a part of Ameritech, one of the 7 original Regional Bell Operating Companies...

 telephone company. One of the largest switching
Switching
LAN switching is a form of packet switching used in local area networks. Switching technologies are crucial to network design, as they allow traffic to be sent only where it is needed in most cases, using fast, hardware-based methods.- Layer 2 switching :...

 systems in the state, the facility processed more than 3.5 million calls each day while serving 38,000 customers, including numerous businesses, hospitals, and Chicago’s O’Hare and Midway Airports.

Virtually the entire AT&T
AT&T
AT&T Inc. is an American multinational telecommunications corporation headquartered in Whitacre Tower, Dallas, Texas, United States. It is the largest provider of mobile telephony and fixed telephony in the United States, and is also a provider of broadband and subscription television services...

 network of 4ESS toll tandems switches went in and out
of service over and over again on Jan. 15, 1990 disrupting long distance service for the entire nation. The problem dissipated by itself when traffic slowed down. A software bug
was found.

AT&T lost its frame relay
Frame relay
Frame Relay is a standardized wide area network technology that specifies the physical and logical link layers of digital telecommunications channels using a packet switching methodology...

 network for 26 hours on April 13, 1998. This affected many thousands of customers, and bank transactions were one casualty. AT&T failed to meet the service level agreement
Service Level Agreement
A service-level agreement is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time or performance...

 on their contracts with customers and had to refund 6600 customer accounts, costing millions of dollars.

Xbox Live
Xbox Live
Xbox Live is an online multiplayer gaming and digital media delivery service created and operated by Microsoft Corporation. It is currently the only online gaming service on consoles that charges users a fee to play multiplayer gaming. It was first made available to the Xbox system in 2002...

 had intermittent downtime during the 2007-2008 holiday season which lasted thirteen days. Increased demand from Xbox 360 purchasers (the largest number of new user sign-ups in the history of Xbox Live) was given as the reason for the downtime; in order to make amends for the service issues, Microsoft offered their users the opportunity to receive a free game.

Sony
Sony
, commonly referred to as Sony, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan and the world's fifth largest media conglomerate measured by revenues....

's PlayStation Network April 2011 outage
PlayStation Network outage
The PlayStation Network outage was the result of an "external intrusion" on Sony's PlayStation Network and Qriocity services, in which personal details from approximately 77 million accounts were stolen and prevented users of PlayStation 3 and PlayStation Portable consoles from playing online...

, began on April 20, 2011, and was gradually restored on May 14, 2011 starting in the United States
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...

. This outage is the longest amount of time the PSN has been offline since its inception in 2006. Sony has stated the problem was caused by an external intrusion which resulted in the confiscation of personal information. Sony reported on April 26, 2011 that a large amount of user data had been obtained by the same hack that resulted in the downtime.

Service levels

In service level agreement
Service Level Agreement
A service-level agreement is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time or performance...

s, it is common to mention a percentage value (per month or per year) that is calculated by dividing the sum of all downtimes timespans by the total time of a reference time span (e.g. a month). 0% downtime means that the server was available all the time.

For Internet servers downtimes above 1% per year or worse can be regarded as unacceptable as this means a downtime of more than 3 days per year. For e-commerce and other industrial use any value above 0.1% is usually considered unacceptable.

Response and reduction of impact

It is the duty of the network designer to make sure that a network outage does not happen. When it does happen, a well-designed system will further reduce the effects of an outage by having localized outages which can be detected and fixed as soon as possible.

A process needs to be in place to detect a malfunction - network monitoring
Network monitoring
The term network monitoring describes the use of a system that constantly monitors a computer network for slow or failing components and that notifies the network administrator in case of outages...

 - and to restore the network to a working condition - this generally involves a help desk
Help desk
A help desk is an information and assistance resource that troubleshoots problems with computers or similar products. Corporations often provide help desk support to their customers via a toll-free number, website and e-mail. There are also in-house help desks geared toward providing the same kind...

 team that can troubleshoot a problem, one composed of trained engineers; a separate help desk team is usually necessary in order to field user input, which can be particularly demanding during a downtime.

A network management
Network management
Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance, and provisioning of networked systems....

 system can be used to detect faulty or degrading components prior to customer complaints, with proactive fault rectification.

Risk management
Risk management
Risk management is the identification, assessment, and prioritization of risks followed by coordinated and economical application of resources to minimize, monitor, and control the probability and/or impact of unfortunate events or to maximize the realization of opportunities...

 techniques can be used to determine the impact of network outages on an organisation and what actions may be required to minimise risk. Risk may be minimised by using reliable components, by performing maintenance, such as upgrades, by using redundant systems
Redundancy (engineering)
In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe....

 or by having a contingency plan
Contingency plan
A contingency plan is a plan devised for an exceptional risk which is impractical or impossible to avoid. Contingency plans are often devised by governments or businesses who want to be prepared for events which, while highly unlikely, may have catastrophic effects. For example, suppose many...

 or business continuity plan
Business continuity planning
Business continuity planning “identifies [an] organization's exposure to internal and external threats and synthesizes hard and soft assets to provide effective prevention and recovery for the organization, whilst maintaining competitive advantage and value system integrity”. It is also called...

.
Technical means can reduce errors with error correcting codes, retransmission
Retransmission (data networks)
Retransmission, essentially identical with Automatic repeat request , is the resending of packets which have been either damaged or lost. It is a term that refers to one of the basic mechanisms used by protocols operating over a packet switched computer network to provide reliable communication...

, checksums, or diversity scheme
Diversity scheme
In telecommunications, a diversity scheme refers to a method for improving the reliability of a message signal by using two or more communication channels with different characteristics. Diversity plays an important role in combatting fading and co-channel interference and avoiding error bursts...

.

Planning

A planned outage is the result of a planned activity by the system owner and/or by a service provider
Service provider
A service provider is an entity that provides services to other entities. Usually, this refers to a business that provides subscription or web service to other businesses or individuals. Examples of these services include Internet access, Mobile phone operators, and web application hosting...

. These outages, often scheduled during the maintenance window
Maintenance window
In information technology and systems management, a maintenance window is a period of time designated in advance by the technical staff, during which preventive maintenance that could cause disruption of service may be performed.- High availability services :...

, can be used to perform tasks including the following:
  • Deferred maintenance, e.g., a deferred hardware repair or a deferred restart to clean-up a garbled memory
  • Diagnostics to isolate a detected fault
  • Hardware fault repair
  • Fixing an error or omission in a configuration database or omission in a recent configuration database change
  • Fixing an error in application database or an error in a recent application database change
  • Software patching/software updates to fix a software fault.


Outages can also be planned as a result of a predictable natural event, such as Sun outage
Sun outage
A sun outage, sun transit or sun fade is an interruption in or distortion of geostationary satellite signals caused by interference from solar radiation. The effect is due to the sun's radiation overwhelming the satellite signal. Generally, sun outages occur in February, March, September and...

.

Maintenance downtimes have to be carefully scheduled in industries that rely on computer systems. In many cases, system-wide downtimes can be averted using what is called a "rolling upgrade" - the process of incrementally taking down parts of the system for upgrade, without affecting the overall functionality.

Avoidance

For most websites, website monitoring
Website monitoring
Website monitoring is the process of testing and verifying that end-users can interact with a website or web application. Website monitoring is often used by businesses to ensure that their sites are live and responding....

 is available. Website monitoring (synthetic or passive) is a service that "monitors" downtime and users on the site.

Other usage

Downtime can also refer to time when human capital or other assets go down. For instance, if employees are in meetings or unable to perform their work due to another constraint, they are down. This can be equally expensive, and can be the result of another asset (i.e. computer/systems) being down. This is also commonly known as "dead time
Dead time
For detection systems that record discrete events, such as particle and nuclear detectors, the dead time is the time after each event during which the system is not able to record another event....

".

This term is used also in factories or industrial use. See total productive maintenance
Total Productive Maintenance
Total productive maintenance originated in Japan in 1971 as a method for improved machine availability through better utilization of maintenance and production resources....

 (TPM).

Measuring Downtime

There are a many external services which can be used to monitor the uptime and downtime as well as availability of a service or a host. Some examples:
  • Pingdom
    Pingdom
    Pingdom is a service that tracks the uptime, downtime, and performance of websites. Based in Sweden, Pingdom monitors websites from multiple locations globally so that it can distinguish genuine downtime from routing and access problems....

  • Watchmouse

See also

  • High availability
    High availability
    High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period....

  • Uptime
    Uptime
    Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

  • Mean down time
    Mean down time
    In organizational management, mean down time is the average time that a system is non-operational. This includes all time associated with repair, corrective and preventive maintenance, self imposed downtime, and any logistics or administrative delays...

  • Planned downtime
  • Carrier grade
    Carrier grade
    In telecommunication, a "carrier grade" or "carrier class" refers to a system, or a hardware or software component that is extremely reliable, well tested and proven in its capabilities...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK