Scratchpad RAM
Encyclopedia
Scratchpad memory also known as scratchpad, scatchpad RAM or local store in computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

 terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...

 ("CPU"), scratchpad refers to a special high-speed memory circuit
Electronic circuit
An electronic circuit is composed of individual electronic components, such as resistors, transistors, capacitors, inductors and diodes, connected by conductive wires or traces through which electric current can flow...

 used to hold small items of data for rapid retrieval.

It can be considered similar to the L1 cache in that it is the next closest memory to the ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

 after the internal registers, with explicit instructions to move data from and to main memory, often using
DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....

-based data transfer. In contrast with a system that uses caches, a system with scratchpads is a system with Non-Uniform Memory Access
Non-Uniform Memory Access
Non-Uniform Memory Access is a computer memory design used in Multiprocessing, where the memory access time depends on the memory location relative to a processor...

 latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference with a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory.

Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in multiprocessor system-on-chip
MPSoC
The multiprocessor System-on-Chip is a system-on-a-chip which uses multiple processors , usually targeted for embedded applications...

 for embedded systems. They are mostly suited for storing temporary results (as it would be found in the CPU stack) that typically wouldn't need to always be committing to the main memory; however when fed by DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....

, they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of locality of reference
Locality of reference
In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...

 apply in relation to efficiency of use; although some systems allow strided DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications.

Scratchpads are not used in mainstream desktop processors where generality is required for legacy software to run from generation to generation, in which the available on-chip memory size may change. They are better implemented in embedded systems, special-purpose processors and game consoles, where chips are often manufactured as MPSoC
MPSoC
The multiprocessor System-on-Chip is a system-on-a-chip which uses multiple processors , usually targeted for embedded applications...

, and where software is often tuned to one hardware configuration.

Examples of use

  • The Cyrix 6x86
    Cyrix 6x86
    The Cyrix 6x86 is a sixth-generation, 32-bit 80x86-compatible microprocessor designed by Cyrix and manufactured by IBM and SGS-Thomson. It was originally released in 1996.-Architecture:...

    , the only x86-compatible desktop processor to incorporate a dedicated Scratchpad.

  • SuperH
    SuperH
    SuperH is a 32-bit reduced instruction set computer instruction set architecture developed by Hitachi. It is implemented by microcontrollers and microprocessors for embedded systems....

    , used in Sega's consoles, could lock cachelines to an address outside of main memory for use as a Scratchpad.

  • The Sony PS1's R3000
    R3000
    The R3000 is a microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture . Introduced in June 1988, it was the second MIPS implementation, succeeding the R2000 as the flagship MIPS microprocessor...

     had a Scratchpad instead of an L1 cache. It was possible to place the CPU stack here, an example of the temporary workspace usage.

  • Sony's PS2 Emotion Engine
    Emotion Engine
    The Emotion Engine is a CPU developed and manufactured by Sony Computer Entertainment and Toshiba for use in the Sony PlayStation 2 video game console, as well as early PlayStation 3 models sold in Japan and North America...

     employed a 16KiB Scratchpad, to and from which DMA transfers could be issued to its GS, and main memory.

  • The Cell's SPEs are restricted purely to working in their "local-store", relying on DMA for transfers from/to main memory and between local stores, much like a Scratchpad. In this regard, additional benefit is derived from the lack of hardware to check and update coherence between multiple caches: the design takes advantage of the assumption that each processor's workspace is separate and private. It is expected this benefit will become more noticeable as the number of processors scales into the "many-core" future.

  • Many other processors allow L1 cache lines to be locked.

  • Most DSPs
    Digital signal processor
    A digital signal processor is a specialized microprocessor with an architecture optimized for the fast operational needs of digital signal processing.-Typical characteristics:...

     use a Scratchpad. Many past 3D accelerators and game consoles (including the PS2) have used DSPs for vertex transformations. This differs with the stream based approach of modern GPUs which have more in common with a CPU cache's functions.

  • NVIDIA's 8800 GPU running under CUDA
    CUDA
    CUDA or Compute Unified Device Architecture is a parallel computing architecture developed by Nvidia. CUDA is the computing engine in Nvidia graphics processing units that is accessible to software developers through variants of industry standard programming languages...

     provides 16KiB of Scratchpad per thread-bundle when being used for gpgpu tasks.

  • Ageia's PhysX
    Physics processing unit
    A physics processing unit is a dedicated microprocessor designed to handle the calculations of physics, especially in the physics engine of video games. Examples of calculations involving a PPU might include rigid body dynamics, soft body dynamics, collision detection, fluid dynamics, hair and...

    chip utilizes Scratchpad RAM in a manner similar to the Cell; its theory states that a cache hierarchy is of less use than software managed physics and collision calculations. These memories are also banked and a switch manages transfers between them.

Cache control vs Scratchpads

Many architectures such as PowerPC attempt to avoid the need for cacheline locking or scratchpads through the use of cache control instructions. Marking an area of memory with "Data Cache Block: Zero" (allocating a line but setting its contents to zero instead of loading from main memory) and discarding it after use ('Data Cache Block: Invalidate', signaling that main memory needn't receive any updated data) the cache is made to behave as a scratchpad. Generality is maintained in that these are hints and the underlying hardware will function correctly regardless of actual cache size.

Shared L2 vs Cell local stores

Regarding interprocessor communication in a multicore setup, there are similarities between the Cell's inter-localstore DMA and a Shared L2 cache setup as in the Intel Core 2 Duo or the Xbox 360's custom powerPC: the L2 cache allows processors to share results without those results having to be committed to main memory.
This can be an advantage where the working set for an algorithm encompasses the entirety of the L2 cache.
However, when a program is written to take advantage of inter-localstore DMA, the Cell has the benefit of each-other-Local-Store serving the purpose of BOTH the private workspace for a single processor AND the point of sharing between processors; i.e., the other Local Stores are on a similar footing viewed from one processor as the shared L2 cache in a conventional chip. The tradeoff is that of memory wasted in buffering and programming complexity for synchronization, though this would be similar to precached pages in a conventional chip.
Domains where using this capability is effective include:
  • Pipeline processing (where one achieves the same effect as increasing the L1 cache's size by splitting one job into smaller chunks).

  • Extending the working set, e.g., a sweet spot for a merge sort where the data fits within 8x256KiB

  • Shared code uploading, like loading a piece of code to one SPU, then copy it from there to the others to avoid hitting the main memory again.


It would be possible for a conventional processor to gain similar advantages with cache-control instructions, for example, allowing the prefetching to the L1 bypassing the L2, or an eviction hint that signaled a transfer from L1 to L2 but not committing to main memory; however, at present no systems offer this capability in a usable form and such instructions in effect should mirror explicit transfer of data among cache areas used by each core.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK