Recovery-oriented computing
Encyclopedia
Recovery-oriented computing (sometimes abbreviated to ROC) is a method constructed at Stanford University
Stanford University
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...

 and the University of California, Berkeley
University of California, Berkeley
The University of California, Berkeley , is a teaching and research university established in 1868 and located in Berkeley, California, USA...

 for developing reliable Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

 services. Its proponents seek to recognize computer bugs as inevitable, and then reduce their harmful effect
Causality
Causality is the relationship between an event and a second event , where the second event is understood as a consequence of the first....

s. The National Science Foundation
National Science Foundation
The National Science Foundation is a United States government agency that supports fundamental research and education in all the non-medical fields of science and engineering. Its medical counterpart is the National Institutes of Health...

 funds the project.

There are characteristics that set recovery oriented computing apart from all other failure handling techniques.

Isolation and redundancy

Isolation in these types of systems requires redundancy. Should one part of the system fail, a redundant part will need to take its place. Isolation must be failure proof for all types of failures whether they be software
Software bug
A software bug is the common term used to describe an error, flaw, mistake, failure, or fault in a computer program or system that produces an incorrect or unexpected result, or causes it to behave in unintended ways. Most bugs arise from mistakes and errors made by people in either a program's...

 or human caused failures. One potential way to isolate parts of a system is using virtual machine monitors such as Xen
Xen
Xen is a virtual-machine monitor providing services that allow multiple computer operating systems to execute on the same computer hardware concurrently....

. Virtual machine monitors allow many virtual machines to run on a physical machine and should there be a problem with one virtual machine it can be restarted without restarting the physical machine, or it can be stopped and another can take its place.

System-wide undo support

The ability to undo
Undo
Undo is a command in many computer programs. It erases the last change done to the document reverting it to an older state. In some more advanced programs such as graphic processing, undo will negate the last command done to the file being edited....

 across different programs and time frames is absolutely necessary in this type of system because human error is the only cause of system failures. Humans innately have the mind to do so. Not having undo support also limits testing aspects of a production system because it doesn’t allow for trial and error
Trial and error
Trial and error, or trial by error, is a general method of problem solving, fixing things, or for obtaining knowledge."Learning doesn't happen from failure itself but rather from analyzing the failure, making a change, and then trying again."...

.

System-wide undo support should cover all aspects of the system. This includes hardware and software upgrades, configuration as well as application management. There are obviously limits to what can be undone, and these limits are currently being explored, tested and rated based on their tradeoffs.

Integrated diagnostic support

Integrated diagnostic support is another characteristic a recovery-oriented computer should have. This means that the system should be able to identify the root cause of a system failure. Once it does this it should then either be able to contain the failure so it cannot affect other parts of the system or alternatively it should repair the failure. All of the system components or modules should be self-testing; it should be able to know when there is something wrong with itself. As well as determining problems with themselves, the modules should also be able to verify the behavior of other modules that they are dependent upon. The system must also track module, resource, and user request dependencies throughout the system. This will allow for containment of failures.

Online verification and recovery mechanisms

Recovery mechanisms are ways in which the systems can recover from failures. These recovery mechanisms should be well designed, meaning that they are reliable, effective and efficient. These systems should be proactive in testing and verifying the behavior of the recovery mechanisms so should there be a real failure it is certain that these mechanisms will do what they are designed to do and aid in the recovery of the system. These verifications should be performed even in production level equipment as this type of equipment is the most vital to have up. There are two methods for performing these tests and both of these should be used. The first method is directed tests in which the tests are set up and executed. The other method is a random test in which they occur without warning.

Modularity, measurability and restartability

Software aging
Software aging
In software engineering, software aging refers to progressive performance degradation or a sudden hang/crash of a software system due to exhaustion of operating system resources, fragmentation and accumulation of errors. A proactive fault management method to deal with the software aging...

 problems are best resolved by restarting the component that is affected. This entails both modularity and restartability. Components should be restarted before they fail, and designed to make this option available or better yet, do it automatically. Applications should also be designed for restartability.

Benchmarks

These systems should have frequent dependability and availability benchmarking to justify their existence and usage by tracking their progress. These benchmarks should be reproducible and an impartial measure of system dependability, reliability, and availability.

External links

  • The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project, the official web site
    Website
    A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

    , which to date includes information on research, people, publications, talks, retreats, and projects
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK