Superscalar

# Superscalar

Discussion

Encyclopedia
A superscalar CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

architecture implements a form of parallelism called instruction level parallelism
Instruction level parallelism
Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f...

within a single processor. It therefore allows faster CPU throughput
Throughput
In communication networks, such as Ethernet or packet radio, throughput or network throughput is the average rate of successful message delivery over a communication channel. This data may be delivered over a physical or logical link, or pass through a certain network node...

than would otherwise be possible at a given clock rate
Clock rate
The clock rate typically refers to the frequency that a CPU is running at.For example, a crystal oscillator frequency reference typically is synonymous with a fixed sinusoidal waveform, a clock rate is that frequency reference translated by electronic circuitry into a corresponding square wave...

. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

, a bit shifter, or a multiplier.

In the Flynn Taxonomy, a superscalar processor is classified as a MIMD processor (Multiple Instructions, Multiple Data).

While a superscalar CPU is typically also pipeline
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

d, pipelining and superscalar architecture are considered different performance enhancement techniques.

The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU core):
• Instructions are issued from a sequential instruction stream
• CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time
Compile time
In computer science, compile time refers to either the operations performed by a compiler , programming language requirements that must be met by source code for it to be successfully compiled , or properties of the program that can be reasoned about at compile time.The operations performed at...

)
• The CPU accepts multiple instructions per clock cycle

## History

Seymour Cray
Seymour Cray
Seymour Roger Cray was an American electrical engineer and supercomputer architect who designed a series of computers that were the fastest in the world for decades, and founded Cray Research which would build many of these machines. Called "the father of supercomputing," Cray has been credited...

's CDC 6600
CDC 6600
The CDC 6600 was a mainframe computer from Control Data Corporation, first delivered in 1964. It is generally considered to be the first successful supercomputer, outperforming its fastest predecessor, IBM 7030 Stretch, by about three times...

from 1965 is often mentioned as the first superscalar design. The Intel i960
Intel i960
Intel's i960 was a RISC-based microprocessor design that became popular during the early 1990s as an embedded microcontroller, becoming a best-selling CPU in that field, along with the competing AMD 29000...

CA (1988) and the AMD 29000-series 29050 (1990) microprocessors were the first commercial single-chip superscalar microprocessors. RISC CPUs like these were first microcomputers to use the superscalar concept, because the RISC design results in a simple core, thereby allowing the inclusion of multiple functional units (such as ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s) on a single CPU in the constrained design rules of the time (this was why RISC designs were faster than CISC
Complex instruction set computer
A complex instruction set computer , is a computer where single instructions can execute several low-level operations and/or are capable of multi-step operations or addressing modes within single instructions...

designs through the 1980s and into the 1990s).

Except for CPUs used in low-power
Low-power
In electronics, the term low-power may mean:* Low-power broadcasting, that the power of the broadcast is less, i.e. the radio waves are not intended to travel as far as from typical transmitters....

applications, embedded system
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...

s, and battery
Battery (electricity)
An electrical battery is one or more electrochemical cells that convert stored chemical energy into electrical energy. Since the invention of the first battery in 1800 by Alessandro Volta and especially since the technically improved Daniell cell in 1836, batteries have become a common power...

-powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar.

The P5
P5 (microarchitecture)
The original Pentium microprocessor was introduced on March 22, 1993. Its microarchitecture, deemed P5, was Intel's fifth-generation and first superscalar x86 microarchitecture. As a direct extension of the 80486 architecture, it included dual integer pipelines, a faster FPU, wider data bus,...

Pentium was the first superscalar x86 processor; the Nx586, P6
P6 (microarchitecture)
The P6 microarchitecture is the sixth generation Intel x86 microarchitecture, implemented by the Pentium Pro microprocessor that was introduced in November 1995. It is sometimes referred to as i686. It was succeeded by the NetBurst microarchitecture in 2000, but eventually revived in the Pentium M...

Pentium Pro
Pentium Pro
The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel introduced in November 1, 1995 . It introduced the P6 microarchitecture and was originally intended to replace the original Pentium in a full range of applications...

and AMD K5
AMD K5
The K5 was AMD's first x86 processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel's Pentium microprocessor. The K5 was an ambitious design, closer to a Pentium Pro than a Pentium regarding technical solutions and internal architecture...

were among the first designs which decode x86-instructions asynchronously into dynamic microcode
Microcode
Microcode is a layer of hardware-level instructions and/or data structures involved in the implementation of higher level machine code instructions in many computers and other processors; it resides in special high-speed memory and translates machine instructions into sequences of detailed...

-like micro-op sequences prior to actual execution on a superscalar microarchitecture
Microarchitecture
In computer engineering, microarchitecture , also called computer organization, is the way a given instruction set architecture is implemented on a processor. A given ISA may be implemented with different microarchitectures. Implementations might vary due to different goals of a given design or...

; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to the more rigid methods used in the simpler P5
P5 (microarchitecture)
The original Pentium microprocessor was introduced on March 22, 1993. Its microarchitecture, deemed P5, was Intel's fifth-generation and first superscalar x86 microarchitecture. As a direct extension of the 80486 architecture, it included dual integer pipelines, a faster FPU, wider data bus,...

Pentium; it also simplified speculative execution
Speculative execution
Speculative execution in computer systems is doing work, the result of which may not be needed. This performance optimization technique is used in pipelined processors and other systems.-Main idea:...

and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86
Cyrix 6x86
The Cyrix 6x86 is a sixth-generation, 32-bit 80x86-compatible microprocessor designed by Cyrix and manufactured by IBM and SGS-Thomson. It was originally released in 1996.-Architecture:...

.

## From scalar to superscalar

The simplest processors are scalar processor
Scalar processor
Scalar processors represent the simplest class of computer processors. A scalar processor processes one datum at a time . , a scalar processor is classified as a SISD processor .In a vector processor, by contrast, a single instruction operates simultaneously on multiple data items...

s. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor
Vector processor
A vector processor, or array processor, is a central processing unit that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items...

operates simultaneously on many data items. An analogy is the difference between scalar
Scalar (mathematics)
In linear algebra, real numbers are called scalars and relate to vectors in a vector space through the operation of scalar multiplication, in which a vector can be multiplied by a number to produce another vector....

and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple redundant functional units within each CPU thus multiple instructions can be processing separate data items concurrently.

Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple functional units in use at all times. This has become increasingly important when the number of units increased. While early superscalar CPUs would have two ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s and a single FPU
Floating point unit
A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root...

, a modern design such as the PowerPC 970
PowerPC 970
The PowerPC 970, PowerPC 970FX, PowerPC 970GX, and PowerPC 970MP, are 64-bit Power Architecture processors from IBM introduced in 2002. When used in Apple Inc. machines, they were dubbed the PowerPC G5....

includes four ALUs, two FPUs, and two SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will suffer.

A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle
Cycles Per Instruction
In computer architecture, cycles per instruction is a term used to describe one aspect of a processor's performance: the number of clock cycles that happen when an instruction is being executed...

. But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

, multiprocessor
Multiprocessor
Computer system having two or more processing units each sharing main memory and peripherals, in order to simultaneously process programs.Sometimes the term Multiprocessor is confused with the term Multiprocessing....

or multi-core
Multi-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...

architectures also achieve that, but with different methods.

In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to redundant functional units contained inside a single CPU. Therefore a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread.

## Limitations

Available performance improvement from superscalar techniques is limited by three key areas:
1. The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism.
2. The complexity and time cost of the dispatcher and associated dependency checking logic.
3. The branch instruction processing.

Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions `a = b + c; d = e + f` can be run in parallel because none of the results depend on other calculations. However, the instructions `a = b + c; b = e + f` might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units.

When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU's clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be gates, and the delay cost , where is the number of instructions in the processor's instruction set, and is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutation
Permutation
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting objects or values. Informally, a permutation of a set of objects is an arrangement of those objects into a particular order...

s.

Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results.

No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of functional units (e.g., ALUs), the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small, likely on the order of five to six simultaneously dispatched instructions.

However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.

## Alternatives

Collectively, these limits drive investigation into alternative architectural changes such as Very Long Instruction Word
Very long instruction word
Very long instruction word or VLIW refers to a CPU architecture designed to take advantage of instruction level parallelism . A processor that executes every instruction one after the other may use processor resources inefficiently, potentially leading to poor performance...

(VLIW), Explicitly Parallel Instruction Computing
Explicitly Parallel Instruction Computing
Explicitly parallel instruction computing is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures...

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...

(SMT), and multi-core processors
Multi-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...

.

With VLIW, the burdensome task of dependency checking by hardware logic at run time is removed and delegated to the compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

. Explicitly Parallel Instruction Computing
Explicitly Parallel Instruction Computing
Explicitly parallel instruction computing is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures...

(EPIC) is like VLIW, with extra cache prefetching instructions.

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

Superscalar processors differ from multi-core processors in that the redundant functional units are not entire processors. A single processor is composed of finer-grained functional units such as the ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

, integer
Integer (computer science)
In computer science, an integer is a datum of integral data type, a data type which represents some finite subset of the mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values....

multiplier, integer shifter, floating point unit
Floating point unit
A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root...

, etc. There may be multiple versions of each functional unit to enable execution of many instructions in parallel. This differs from a multi-core processor that concurrently processes instructions from multiple threads, one thread per core. It also differs from a pipelined CPU, where the multiple instructions can concurrently be in various stages of execution, assembly-line
Assembly line
An assembly line is a manufacturing process in which parts are added to a product in a sequential manner using optimally planned logistics to create a finished product much faster than with handcrafting-type methods...

fashion.

The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in a single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector
Vector processor
A vector processor, or array processor, is a central processing unit that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items...

capability.

Super-threading is a type of multithreading that enables different threads to be executed by a single processor without truly executing them at the same time. This qualifies it as time-sliced or temporal multithreading rather than simultaneous multithreading...

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...

(SMT)
• Speculative execution
Speculative execution
Speculative execution in computer systems is doing work, the result of which may not be needed. This performance optimization technique is used in pipelined processors and other systems.-Main idea:...

/ Eager execution
• Software lockout
Software lockout
In multiprocessor computer systems, software lockout is the issue of performance degradation due to the idle wait times spent by the CPUs in kernel-level critical sections. Software lockout is the major cause of scalability degradation in a multiprocessor system, posing a limit on the maximum...

, a multiprocessor issue similar to logic dependencies on superscalars