Troubleshooting
Encyclopedia
Troubleshooting is a form of problem solving
Problem solving
Problem solving is a mental process and is part of the larger problem process that includes problem finding and problem shaping. Consideredthe most complex of all intellectual functions, problem solving has been defined as higher-order cognitive process that requires the modulation and control of...

, often applied to repair failed products or processes. It is a logical, systematic search for the source of a problem so that it can be solved, and so the product or process can be made operational again. Troubleshooting is needed to develop and maintain complex systems where the symptoms of a problem can have many possible causes. Troubleshooting is used in many fields such as engineering
Engineering
Engineering is the discipline, art, skill and profession of acquiring and applying scientific, mathematical, economic, social, and practical knowledge, in order to design and build structures, machines, devices, systems, materials and processes that safely realize improvements to the lives of...

, system administration, electronics
Electronics
Electronics is the branch of science, engineering and technology that deals with electrical circuits involving active electrical components such as vacuum tubes, transistors, diodes and integrated circuits, and associated passive interconnection technologies...

, automotive repair, and diagnostic medicine. Troubleshooting requires identification of the malfunction(s) or symptoms within a system. Then, experience is commonly used to generate possible causes of the symptoms. Determining which cause is most likely is often a process of elimination
Process of elimination
Process of elimination is a method to identify an entity of interest among several ones by excluding all other entities.-In education testing:...

 - eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

In general, troubleshooting is the identification of, or diagnosis
Diagnosis
Diagnosis is the identification of the nature and cause of anything. Diagnosis is used in many different disciplines with variations in the use of logics, analytics, and experience to determine the cause and effect relationships...

 of "trouble" in the management flow of a corporation or a system caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining and remedying to the causes of these symptoms.

A system can be described in terms of its expected, desired or intended (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example selecting the "print" option from various computer applications is intended to result in a hardcopy emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example).

The methods of forensic engineering
Forensic engineering
Forensic engineering is the investigation of materials, products, structures or components that fail or do not operate or function as intended, causing personal injury or damage to property. The consequences of failure are dealt with by the law of product liability. The field also deals with...

 are especially useful in tracing problems in products or processes, and a wide range of analytical techniques are available to determine the cause or causes of specific failure
Failure
Failure refers to the state or condition of not meeting a desirable or intended objective, and may be viewed as the opposite of success. Product failure ranges from failure to sell the product to fracture of the product, in the worst cases leading to personal injury, the province of forensic...

s. Corrective action can then be taken to prevent further failures of a similar kind. Preventative action is possible using failure mode and effects analysis (FMEA)
Failure mode and effects analysis
A failure modes and effects analysis is a procedure in product development and operations management for analysis of potential failure modes within a system for classification by the severity and likelihood of the failures...

 and fault tree analysis (FTA)
Fault tree analysis
Fault tree analysis is a top down, deductive failure analysis in which an undesired state of a system is analyzed using boolean logic to combine a series of lower-level events...

 before full scale production, and these methods can also be used for failure analysis
Failure analysis
Failure analysis is the process of collecting and analyzing data to determine the cause of a failure. It is an important discipline in many branches of manufacturing industry, such as the electronics industry, where it is a vital tool used in the development of new products and for the improvement...

.

Aspects

Most discussion of troubleshooting, and especially training in formal troubleshooting procedures, tends to be domain specific, even though the basic principles are universally applicable.

Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example a printer that "was working when it was plugged in over there"). However, there is a well known principle that correlation
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....

 does not imply causality
Causality
Causality is the relationship between an event and a second event , where the second event is understood as a consequence of the first....

. (For example the failure of a device shortly after it's been plugged into a different outlet doesn't necessarily mean that the events were related. The failure could have been a matter of coincidence
Coincidence
A coincidence is an event notable for its occurring in conjunction with other conditions, e.g. another event. As such, a coincidence occurs when something uncanny, accidental and unexpected happens under conditions named, but not under a defined relationship...

.) Therefore troubleshooting demands critical thinking
Critical thinking
Critical thinking is the process or method of thinking that questions assumptions. It is a way of deciding whether a claim is true, false, or sometimes true and sometimes false, or partly true and partly false. The origins of critical thinking can be traced in Western thought to the Socratic...

 rather than magical thinking
Magical thinking
Magical thinking is causal reasoning that looks for correlation between acts or utterances and certain events. In religion, folk religion, and superstition, the correlation posited is between religious ritual, such as prayer, sacrifice, or the observance of a taboo, and an expected benefit or...

.

It's useful to consider the common experiences we have with light bulbs. Light bulbs "burn out" more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.

A basic principle in troubleshooting is to start from the simplest and most probable
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...

 possible problems first. This is illustrated by the old saying "When you see hoof prints, look for horses, not zebras", or to use another maxim, use the KISS principle
KISS principle
KISS is an acronym for the design principle Keep it simple, Stupid!. Other variations include "keep it simple and stupid", "keep it short and simple", "keep it simple sir", "keep it simple or be stupid" or "keep it simple and straightforward"...

. This principle results in the common complaint about help desk
Help desk
A help desk is an information and assistance resource that troubleshoots problems with computers or similar products. Corporations often provide help desk support to their customers via a toll-free number, website and e-mail. There are also in-house help desks geared toward providing the same kind...

s or manuals, that they sometimes first ask: "Is it plugged in and does that receptacle have power?", but this should not be taken as an affront, rather it should serve as a reminder or conditioning
Conditioning
Conditioning may refer to:* In psychology, the process of performing some particular action to directly influence an individual's learning; see education...

 to always check the simple things first before calling for help.

A troubleshooter could check each component in a system
System
System is a set of interacting or interdependent components forming an integrated whole....

 one by one, substituting known good components for each potentially suspect one. However, this process of "serial substitution" can be considered degenerate when components are substituted without regards to a hypothesis concerning how their failure could result in the symptoms being diagnosed.

Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (feedback loops). Such systems are less amenable to "bisection" troubleshooting techniques.

It also helps to start from a known good state, the best example being a computer reboot. A cognitive walkthrough
Cognitive walkthrough
The cognitive walkthrough method is a usability inspection method used to identify usability issues in a piece of software or web site, focusing on how easy it is for new users to accomplish tasks with the system...

 is also a good thing to try. Comprehensive documentation
Documentation
Documentation is a term used in several different ways. Generally, documentation refers to the process of providing evidence.Modules of Documentation are Helpful...

 produced by proficient technical writer
Technical writer
A technical writer is a professional writer who designs, creates, and maintains technical documentation...

s is very helpful, especially if it provides a theory of operation
Theory of operation
A theory of operation is a description of how a device or system should work. It is often included in documentation, especially maintenance/service documentation, or a user manual. It aids troubleshooting by providing the troubleshooter with a mental model of how the system is supposed to work...

 for the subject device or system.

A common cause of problems is bad design
Design
Design as a noun informally refers to a plan or convention for the construction of an object or a system while “to design” refers to making this plan...

, for example bad human factors
Human factors
Human factors science or human factors technologies is a multidisciplinary field incorporating contributions from psychology, engineering, industrial design, statistics, operations research and anthropometry...

 design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (behavior-shaping constraint
Behavior-shaping constraint
A behavior-shaping constraint, also sometimes referred to as a forcing function or poka-yoke, is a technique used in error-tolerant design to prevent the user from making common errors or mistakes. One example is the reverse lockout on the transmission of a moving automobile.The microwave provides...

), or a lack of error-tolerant design. This is especially bad if accompanied by habituation
Habituation
Habituation can be defined as a process or as a procedure. As a process it is defined as a decrease in an elicited behavior resulting from the repeated presentation of an eliciting stimulus...

, where the user just doesn't notice the incorrect usage, for instance if two parts have different functions but share a common case so that it isn't apparent on a casual inspection which part is being used.

Troubleshooting can also take the form of a systematic checklist
Checklist
A checklist is a type of informational job aid used to reduce failure by compensating for potential limits of human memory and attention. It helps to ensure consistency and completeness in carrying out a task...

, troubleshooting procedure
Procedure (term)
A procedure is a sequence of actions or operations which have to be executed in the same manner in order to always obtain the same result under the same circumstances ....

, flowchart
Flowchart
A flowchart is a type of diagram that represents an algorithm or process, showing the steps as boxes of various kinds, and their order by connecting these with arrows. This diagrammatic representation can give a step-by-step solution to a given problem. Process operations are represented in these...

 or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.

Some computerized troubleshooting services (such as Primefax, later renamed Maxserve),
immediately show the top 10 solutions with the highest probability of fixing the underlying problem.
The technician can either answer additional questions to advance through the troubleshooting procedure, each step narrowing the list of solutions,
or immediately implement the solution he feels will fix the problem.
These services give a rebate if the technician takes an additional step after the problem is solved: report back the solution that actually fixed the problem.
The computer uses these reports to update its estimates of which solutions have the highest probability of fixing that particular set of symptoms.

Half-splitting

Efficient methodical troubleshooting starts with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist of) tests to eliminate these prospective causes.

Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, checking to ensure that a printer's light is on and that its cable is firmly seated at both ends). This is often referred to as "milking the front panel."

Then, "bisect" the system (for example in a network
printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems "towards" the user's end or "towards" the device).

This latter technique can be particularly efficient in systems with long chains of serialized dependencies or interactions among its components. It's simply the application of a binary search across the range of dependences and is often referred to as "half-splitting".

Reproducing symptoms

One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility ... on finding a procedure to reliably
induce the symptom to occur.

Once this is done then systematic strategies can be employed to isolate the cause or causes of a problem; and the resolution generally involves repairing or replacing those components which are at fault.

Intermittent symptoms

Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent
Intermittent Fault
An intermittent fault, often called simply an "intermittent", is a malfunction of a device or system that occurs at intervals, usually irregular, in a device or system that functions normally at other times. Intermittent faults are common to all branches of technology, including computer software...

. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem.

In computer programming race condition
Race condition
A race condition or race hazard is a flaw in an electronic system or process whereby the output or result of the process is unexpectedly and critically dependent on the sequence or timing of other events...

s often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (analogous to "heating up" a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes.

Intermittent issues can be thus defined:
In particular he asserts that there is a distinction between frequency of occurrence and a "known procedure to consistently reproduce" an issue. For example knowing that an intermittent problem occurs "within" an hour of a particular stimulus or event ... but that sometimes it happens in five minutes and other times it takes almost an hour ... does not constitute a "known procedure" even if the stimulus does increase the frequency of observable exhibitions of the symptom.

Nevertheless, sometimes troubleshooters must resort to statistical methods ... and can only find procedures to increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible. In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the root cause
Root cause
A root cause is rarely an initiating cause of a causal chain which leads to an outcome or effect of interest. Commonly, root cause is misused to describe the depth in the causal chain where an intervention could reasonably be implemented to change performance and prevent an undesirable outcome.In...

 has been found and that the problem is truly solved.

Also, tests may be run to stress certain components to determine if those components have failed.

Multiple problems

Isolating single component failures which cause reproducible symptoms is relatively straightforward.

However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. Features which add redundancy, fault detection and failover
Failover
In computing, failover is automatic switching to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network...

 to a system may also be subject to failure, and enough different component failures in any system will "take it down."

Even in simple systems the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).

Note that, while we talk about "replacing components" the resolution of many problems involves adjustments or tuning rather than "replacement." For example, intermittent breaks in conductors --- or "dirty or loose contacts" might simply need to be cleaned and/or tightened. All discussion of "replacement" should be taken to mean "replacement or adjustment or other maintenance."

See also

  • Bathtub curve
    Bathtub curve
    The bathtub curve is widely used in reliability engineering. It describes a particular form of the hazard function which comprises three parts:*The first part is a decreasing failure rate, known as early failures....

  • Cause and effect
    Cause and effect
    Cause and effect refers to the philosophical concept of causality, in which an action or event will produce a certain response to the action in the form of another event....

  • Forensic engineering
    Forensic engineering
    Forensic engineering is the investigation of materials, products, structures or components that fail or do not operate or function as intended, causing personal injury or damage to property. The consequences of failure are dealt with by the law of product liability. The field also deals with...

  • Problem solving
    Problem solving
    Problem solving is a mental process and is part of the larger problem process that includes problem finding and problem shaping. Consideredthe most complex of all intellectual functions, problem solving has been defined as higher-order cognitive process that requires the modulation and control of...

  • Root cause analysis
    Root cause analysis
    Root cause analysis is a class of problem solving methods aimed at identifying the root causes of problems or events.Root Cause Analysis is any structured approach to identifying the factors that resulted in the nature, the magnitude, the location, and the timing of the harmful outcomes of one...

  • 5 Whys
    5 Whys
    The 5 Whys is a questions-asking method used to explore the cause/effect relationships underlying a particular problem. Ultimately, the goal of applying the 5 Whys method is to determine a root cause of a defect or problem.- Example :...

  • Debugging
    Debugging
    Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece of electronic hardware, thus making it behave as expected. Debugging tends to be harder when various subsystems are tightly coupled, as changes in one may cause bugs to emerge...

     
  • No Trouble Found
    No Trouble Found
    No Trouble Found is a term used in various fields, especially in the electronics industry referring to a system or component that has been returned to the manufacturer or distributor for warranty replacement or service repair, but operates properly when tested...

     
  • RPR Problem Diagnosis
    RPR Problem Diagnosis
    RPR is a problem diagnosis method specifically designed to determine the root cause of IT problems.- Overview :RPR deals with failures, incorrect output and performance issues, and its particular strengths are in the diagnosis of ongoing & recurring grey problems...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK