All Topics  
Superscalar

 
Superscalar

   Email Print
   Bookmark   Link






 

Superscalar



 
 
A superscalar CPU
Central processing unit

A central processing unit is an electronic circuit that can execute computer programs. This broad definition can easily be applied to many early computers that existed long before the term "CPU" ever came into widespread usage....
 architecture implements a form of parallelism called instruction-level parallelism
Instruction level parallelism

Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program:...
 within a single processor.






Discussion
Ask a question about 'Superscalar'
Start a new discussion about 'Superscalar'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Superscalarpipeline
Processor Board Cray 2 Hg
A superscalar CPU
Central processing unit

A central processing unit is an electronic circuit that can execute computer programs. This broad definition can easily be applied to many early computers that existed long before the term "CPU" ever came into widespread usage....
 architecture implements a form of parallelism called instruction-level parallelism
Instruction level parallelism

Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program:...
 within a single processor. It thereby allows faster CPU throughput
Throughput

In communication networks, such as Ethernet or packet radio, throughput is the average rate of successful message delivery over a communication channel....
 than would otherwise be possible at the same clock rate
Clock rate

The clock rate is the fundamental rate in cycles per second for the frequency of the clock in any synchronous circuit. For example, a crystal oscillator frequency reference typically is synonymous with a fixed sinusoidal waveform, a clock rate is that frequency reference translated by electronic circuitry into a corresponding square wav...
. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit
Arithmetic logic unit

In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logicaloperations. The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers....
, a bit shifter, or a multiplier
Multiplication ALU

In digital circuit, a multiplier or multiplication Arithmetic logic unit is a hardware circuit dedicated to multiplication two binary values....
.

While a superscalar CPU is typically also pipelined, they are two different performance enhancement techniques. It is theoretically possible to have a non-pipelined superscalar CPU or a pipelined non-superscalar CPU.

The superscalar technique is traditionally associated with several identifying characteristics. Note these are applied within a given CPU core.

  • Instructions are issued from a sequential instruction stream
  • CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time
    Compile time

    In computer science, compile time refers to either the operations performed by a compiler , programming language requirements that must be met by source code for it to be successfully compiled , or properties of the program that can be reasoned about at compile time....
    )
  • Accepts multiple instructions per clock cycle


History

Seymour Cray
Seymour Cray

Seymour Roger Cray was a United States electrical engineer and supercomputer architect who designed a series of computers that were the fastest in the world for decades, and founded the company Cray Research which would build many of these machines....
's CDC 6600
CDC 6600

The CDC 6600 was a mainframe computer from Control Data Corporation, first delivered in 1964. It is generally considered to be the first successful supercomputer, outperforming its fastest predecessor, IBM 7030 Stretch, by about three times....
 from 1965 is often mentioned as the first superscalar design. The Intel i960
Intel i960

Intel's i960 was a RISC-based microprocessor design that became popular during the early 1990s as an embedded system microcontroller, becoming a best-selling CPU in that field, along with the competing AMD 29000....
CA (1988) and the AMD 29000-series 29050 (1990) microprocessors were the first commercial single chip superscalar microprocessors. RISC CPUs like these brought the superscalar concept to micro computers because the RISC design results in a simple core, allowing straightforward instruction dispatch and the inclusion of multiple functional units (such as ALUs) on a single CPU in the constrained design rules of the time. This was the reason that RISC designs were faster than CISC
Complex instruction set computer

A complex instruction set computer is a computer instruction set architecture in which each instruction can execute several low-level operations, such as a load from Memory , an arithmetic operator, and a memory , all in a single instruction....
 designs through the 1980s and into the 1990s.

Except for CPUs used in low-power
Low-power

In electronics, the term low-power may mean:* Low-power broadcasting, that the power of the broadcast is less, i.e. the radio waves are not intended to travel as far as from typical transmitters....
 applications, embedded system
Embedded system

An embedded system is a special-purpose computer system designed to perform one or a few dedicated functions, often with real-time computing constraints....
s, and battery
Battery (electricity)

In electronics, a battery or voltaic cell is a combination of one or more electrochemical cell Galvanic cells which store chemical energy that can be converted into electric potential energy, creating electricity....
-powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar.

The Pentium
Pentium

Introduced on March 22, 1993, the original Pentium was the first superscalar x86 architecture microprocessor. Its fifth-generation x86 microarchitecture was a direct extension of the 80486 architecture with dual integer pipeline s, a faster FPU unit, wider data bus, and features for further reduced address calculation latency....
 was the first superscalar x86 processor; the Nx586, Pentium Pro
Pentium Pro

The Pentium Pro is a sixth-generation x86-based microprocessor developed and manufactured by Intel introduced in November 1995. It introduced the Intel P6 and was originally intended to replace the original Pentium in a full range of applications....
 and AMD K5
AMD K5

The K5 was Advanced Micro Devices first X86 architecture processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel Corporation Pentium microprocessor....
 were among the first designs which decodes x86-instructions asynchronously into dynamic microcode
Microcode

Microcode is a layer of lowest-level instructions involved in the implementation of machine code instructions in many computers and other processors; it resides in a special high-speed memory and translates machine instructions into sequences of detailed circuit-level operations....
-like micro-op sequences prior to actual execution on a superscalar microarchitecture
Microarchitecture

In computer engineering, microarchitecture is a description of the electrical circuitry of a computer, central processing unit, or digital signal processor that is sufficient for completely describing the operation of the hardware....
; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to the more rigid methods used in the simpler Pentium
Pentium

Introduced on March 22, 1993, the original Pentium was the first superscalar x86 architecture microprocessor. Its fifth-generation x86 microarchitecture was a direct extension of the 80486 architecture with dual integer pipeline s, a faster FPU unit, wider data bus, and features for further reduced address calculation latency....
; it also simplified speculative execution
Speculative execution

In computer science, speculative execution is the execution of Code , the result of which may not be needed. In the context of functional programming, the term "speculative evaluation" is used instead....
 and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86
Cyrix 6x86

The Cyrix 6x86 is a sixth-generation, 32-bit 80x86-compatible microprocessor designed by Cyrix and manufactured by International Business Machines and SGS-Thomson....
.

From scalar to superscalar

The simplest processors are scalar processor
Scalar processor

Scalar processors represent the simplest class of computer processors. A scalar processor processes one data item at a time . In a vector processor, by contrast, a single instruction operates simultaneously on multiple data items....
s. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor
Vector processor

A vector processor, or array processor, is a Central processing unit design where the instruction set includes operations that can perform mathematical operations on multiple data elements simultaneously....
 operates simultaneously on many data items. An analogy is the difference between scalar
Scalar (mathematics)

In linear algebra, real numbers are called scalars and relate to vectors in a vector space through the operation of scalar multiplication, in which a vector can be multiplied by a number to produce another vector....
 and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple redundant functional units within each CPU thus multiple instructions can be processing separate data items concurrently.

Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple functional units in use at all times. This has become increasingly important when the number of units increased. While early superscalar CPUs would have two ALU
Arithmetic logic unit

In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logicaloperations. The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers....
s and a single FPU
Floating point unit

A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division , and square root....
, a modern design such as the PowerPC 970
PowerPC 970

The PowerPC 970, PowerPC 970FX, PowerPC 970GX, and PowerPC 970MP, are 64-bit Power Architecture central processing unit from IBM introduced in 2002....
 includes four ALUs, two FPUs, and two SIMD
SIMD

In computing, SIMD is a technique employed to achieve data level parallelism....
 units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will suffer.

A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle
Cycles Per Instruction

In computer architecture, Cycles per instruction is a term used to describe one aspect of a central processing unit performance: the number of clock cycles that happen when an Instruction is being executed....
. But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core
Multi-core (computing)

A multi-core processor combines two or more independent cores into a single package composed of a single integrated circuit , called a Die , or more dies packaged together....
 architectures also achieve that, but with different methods.

In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to redundant functional units contained inside a single CPU. Therefore a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread.

Limitations

Available performance improvement from superscalar techniques is limited by two key areas:
  1. The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and
  2. The complexity and time cost of the dispatcher and associated dependency checking logic.


Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; d = a + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units.

When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU's clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be gates, and the delay cost , where is the number of instructions in the processor's instruction set, and is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutation
Permutation

In several fields of mathematics the term permutation is used with different but closely related meanings. They all relate to the notion of mapping the element s of a set to other elements of the same set, i.e., exchanging elements of a set....
s.

Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results.

No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of functional units (e.g, ALUs), the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small. -- likely on the order of five to six simultaneously dispatched instructions.

However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.

Alternatives

Collectively, these two limits drive investigation into alternative architectural performance increases such as Very Long Instruction Word
Very long instruction word

Very Long Instruction Word or VLIW refers to a Central processing unit architecture designed to take advantage of instruction level parallelism ....
 (VLIW), Explicitly Parallel Instruction Computing
Explicitly Parallel Instruction Computing

Explicitly Parallel Instruction Computing is a term coined in 1997 by the Itanium to describe a computing paradigm that began to be researched in the early 1980s....
 (EPIC), simultaneous multithreading
Simultaneous multithreading

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar Central processing unit with Multithreading ....
 (SMT), and multi-core processors
Multi-core (computing)

A multi-core processor combines two or more independent cores into a single package composed of a single integrated circuit , called a Die , or more dies packaged together....
.

With VLIW, the burdensome task of dependency checking by hardware logic at run time is removed and delegated to the compiler
Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
. Explicitly Parallel Instruction Computing
Explicitly Parallel Instruction Computing

Explicitly Parallel Instruction Computing is a term coined in 1997 by the Itanium to describe a computing paradigm that began to be researched in the early 1980s....
 (EPIC) is like VLIW, with extra cache prefetching instructions.

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

Superscalar processors differ from multi-core processors in that the redundant functional units are not entire processors. A single processor is composed of finer-grained functional units such as the ALU
Arithmetic logic unit

In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logicaloperations. The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers....
, integer
Integer (computer science)

In computer science, the term integer is used to refer to a data type which represents some finite subset of the mathematical integers. These are also known as integral data types....
 multiplier
Multiplication ALU

In digital circuit, a multiplier or multiplication Arithmetic logic unit is a hardware circuit dedicated to multiplication two binary values....
, integer shifter, floating point unit
Floating point unit

A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division , and square root....
, etc. There may be multiple versions of each functional unit to enable execution of many instructions in parallel. This differs from a multicore CPU that concurrently processes instructions from multiple threads, one thread per core. It also differs from a pipelined CPU, where the multiple instructions can concurrently be in various stages of execution, assembly-line
Assembly line

An assembly line is a manufacturing process in which parts are added to a product in a sequential manner using optimally planned logistics to create a finished product much faster than with handcrafting-type methods....
 fashion.

The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in a single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector
Vector processor

A vector processor, or array processor, is a Central processing unit design where the instruction set includes operations that can perform mathematical operations on multiple data elements simultaneously....
 capability.

See also

  • Super-threading
    Super-threading

    Super-threading is a multithreading approach that weaves together the execution of different threads on a single processor without truly executing them at the same time....
  • Simultaneous multithreading
    Simultaneous multithreading

    Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar Central processing unit with Multithreading ....
  • Speculative execution
    Speculative execution

    In computer science, speculative execution is the execution of Code , the result of which may not be needed. In the context of functional programming, the term "speculative evaluation" is used instead....
     / Eager execution
  • Software lockout
    Software lockout

    In multiprocessor computer systems, software lockout is the issue of performance degradation due to the idle wait times spent by the CPUs in Kernel -level critical sections....
    , a multiprocessor issue similar to logic dependencies on superscalars


External links