Memory scrubbing
Encyclopedia
Memory scrubbing is the process of detecting and correcting bit errors in computer memory
Computer memory
In computing, memory refers to the physical devices used to store programs or data on a temporary or permanent basis for use in a computer or other digital electronic device. The term primary memory is used for the information in physical systems which are fast In computing, memory refers to the...

 by using error-detecting codes like ECC.

Motivation for scrubbing

Due to the high integration density of contemporary computer memory chips
Integrated circuit
An integrated circuit or monolithic integrated circuit is an electronic circuit manufactured by the patterned diffusion of trace elements into the surface of a thin substrate of semiconductor material...

, the individual memory cell structures became small enough to be vulnerable to cosmic ray
Cosmic ray
Cosmic rays are energetic charged subatomic particles, originating from outer space. They may produce secondary particles that penetrate the Earth's atmosphere and surface. The term ray is historical as cosmic rays were thought to be electromagnetic radiation...

s and/or alpha particle
Alpha particle
Alpha particles consist of two protons and two neutrons bound together into a particle identical to a helium nucleus, which is classically produced in the process of alpha decay, but may be produced also in other ways and given the same name...

 emission. The errors caused by these phenomena are called soft error
Soft error
In electronics and computing, a soft error is an error in a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to...

s. This can be a problem for DRAM and SRAM based memories.

The probability of a soft error at any individual memory bit is very small. But,
  • together with the large amount of memory with which computers - especially servers
    Server (computing)
    In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...

     - are equipped nowadays,
  • and together with several months of uptime
    Uptime
    Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...

    ,


the probability of soft errors in the total memory installed is significant.

ECC support for scrubbing

The information in an ECC memory is stored redundantly
Redundancy (information theory)
Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data...

 enough to correct single bit error per memory word. Hence, an ECC memory can support the scrubbing of the memory content. Namely, if the memory controller
Memory controller
The memory controller is a digital circuit which manages the flow of data going to and from the main memory. It can be a separate chip or integrated into another chip, such as on the die of a microprocessor...

 scans systematically through the memory, the single bit errors can be detected, the erroneous bit can be determined using the ECC checksum
Error detection and correction
In information theory and coding theory with applications in computer science and telecommunication, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels...

, and the corrected data can be written back to the memory.

Scrubbing in more detail

It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the multiple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.

In order to not disturb regular memory requests from the CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 and thus prevent decreasing performance
Computer performance
Computer performance is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.Depending on the context, good computer performance may involve one or more of the following:...

, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS
BIOS
In IBM PC compatible computers, the basic input/output system , also known as the System BIOS or ROM BIOS , is a de facto standard defining a firmware interface....

 setup program.

The normal memory reads issued by the CPU or DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....

 devices are checked for ECC errors, but due to data locality
Locality of reference
In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...

 reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.

On some systems, not only the main memory (DRAM-based) is capable of scrubbing but also the CPU cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...

s (SRAM-based). On most systems the scrubbing rates for both can be set independently. Because cache is much smaller than the main memory, the scrubbing for caches does not need to happen as frequently.

Memory scrubbing increases reliability, therefore it can be classified as a RAS
Reliability, Availability and Serviceability
reliability, availability, and serviceability are computer hardware engineering terms. It originated from IBM to advertise the robustness of their mainframe computers. The concept is often known by the acronym RAS....

 feature.

See also

  • Data scrubbing
    Data scrubbing
    Data scrubbing is an error correction technique which uses a background task that periodically inspects memory for errors, and then corrects the error using ECC memory or another copy of the data...

    , a general category containing memory scrubbing
  • Soft error
    Soft error
    In electronics and computing, a soft error is an error in a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to...

    , an important reason for doing memory scrubbing
  • Error detection and correction
    Error detection and correction
    In information theory and coding theory with applications in computer science and telecommunication, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels...

    , a general theory used for memory scrubbing
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK