Out-of-order execution
Encyclopedia
In computer engineering
Computer engineering
Computer engineering, also called computer systems engineering, is a discipline that integrates several fields of electrical engineering and computer science required to develop computer systems. Computer engineers usually have training in electronic engineering, software design, and...

, out-of-order execution (OoOE or OOE) is a paradigm
Paradigm
The word paradigm has been used in science to describe distinct concepts. It comes from Greek "παράδειγμα" , "pattern, example, sample" from the verb "παραδείκνυμι" , "exhibit, represent, expose" and that from "παρά" , "beside, beyond" + "δείκνυμι" , "to show, to point out".The original Greek...

 used in most high-performance microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...

s to make use of instruction cycle
Instruction cycle
An instruction cycle is the basic operation cycle of a computer. It is the process by which a computer retrieves a program instruction from its memory, determines what actions the instruction requires, and carries out those actions...

s that would otherwise be wasted by a certain type of costly delay. In this paradigm, a processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program. In doing so, the processor can avoid being idle while data is retrieved for the next instruction in a program, processing instead the next instructions which is able to run immediately.

History

Out-of-order execution is a restricted form of data flow
Dataflow architecture
Dataflow architecture is a computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures do not have a program counter, or the executability and execution of instructions is solely determined based on the availability of...

 computation, which was a major research area in computer architecture
Computer architecture
In computer science and engineering, computer architecture is the practical art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals and the formal modelling of those systems....

 in the 1970s and early 1980s. Important academic research in this subject was led by Yale Patt
Yale Patt
Yale Nance Patt is an American professor of electrical and computer engineering at The University of Texas at Austin. He holds the Ernest Cockrell, Jr. Centennial Chair in Engineering. In 1965, Patt introduced the WOS module, the first complex logic gate implemented on a single piece of silicon...

 and his HPSm simulator. A paper by James E. Smith
James E. Smith
James E. Smith was Comptroller of the Currency of the United States from 1973 to 1976.James E. Smith was Deputy Under-Secretary of the Treasury before being named Comptroller by President Nixon...

 and A.R. Pleszkun, published in 1985 completed the scheme by describing how the precise behavior of exceptions could be maintained in out-of-order machines.

Arguably the first machine to use out-of-order execution was probably the CDC 6600
CDC 6600
The CDC 6600 was a mainframe computer from Control Data Corporation, first delivered in 1964. It is generally considered to be the first successful supercomputer, outperforming its fastest predecessor, IBM 7030 Stretch, by about three times...

 (1964), which used a scoreboard
Scoreboarding
Scoreboarding is a centralized method, used in the CDC 6600 computer, for dynamically scheduling a pipeline so that the instructions can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependencies of every instruction are logged...

 to resolve conflicts. In modern usage, such scoreboarding is considered to be in-order execution, not out-of-order execution, since such machines stall on the first RAW (Read After Write) conflict. Strictly speaking, such machines initiate execution in-order, although they may complete execution out-of-order.

About three years later, the IBM 360/91 (1966) introduced Tomasulo's algorithm
Tomasulo algorithm
The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially...

, supporting full out-of-order execution.

In 1990, IBM introduced the first out-of-order microprocessor, the POWER1
POWER1
The POWER1 is a multi-chip CPU developed and fabricated by IBM that implemented the POWER instruction set architecture . It was originally known as the “RISC System/6000 CPU” or when an abbreviated form, the “RS/6000 CPU” before introduction of successors required the original name to be replaced...

, although out-of-order execution was limited to floating point instructions only.

Throughout the 1990s out-of-order execution became more common, and was featured in the IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

/Motorola
Motorola
Motorola, Inc. was an American multinational telecommunications company based in Schaumburg, Illinois, which was eventually divided into two independent public companies, Motorola Mobility and Motorola Solutions on January 4, 2011, after losing $4.3 billion from 2007 to 2009...

 PowerPC 601 (1993), Fujitsu
Fujitsu
is a Japanese multinational information technology equipment and services company headquartered in Tokyo, Japan. It is the world's third-largest IT services provider measured by revenues....

/HAL
HAL Computer Systems
HAL Computer Systems, Inc was a Campbell, California-based computer manufacturer founded in 1990 by Andrew Heller, a principal designer of the original IBM POWER architecture...

 SPARC64
SPARC64
SPARC64 is a microprocessor developed by HAL Computer Systems and fabricated by Fujitsu. It implements the SPARC V9 instruction set architecture , the first microprocessor to do so. SPARC64 was HAL's first microprocessor and was the first in the SPARC64 brand. It operates at 101 and 118 MHz...

 (1995), Intel Pentium Pro
Pentium Pro
The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel introduced in November 1, 1995 . It introduced the P6 microarchitecture and was originally intended to replace the original Pentium in a full range of applications...

 (1995), MIPS
MIPS Technologies
MIPS Technologies, Inc. , formerly MIPS Computer Systems, Inc., is most widely known for developing the MIPS architecture and a series of pioneering RISC chips. MIPS provides processor architectures and cores for digital home, networking and mobile applications.MIPS Computer Systems Inc. was...

 R10000
R10000
The R10000, code-named "T5", is a RISC microprocessor implementation of the MIPS IV instruction set architecture developed by MIPS Technologies, Inc. , then a division of Silicon Graphics, Inc. . The chief designers were Chris Rowen and Kenneth C. Yeager...

 (1996), HP PA-8000
PA-8000
The PA-8000 , code-named Onyx, is a microprocessor developed and fabricated by Hewlett-Packard that implemented the PA-RISC 2.0 instruction set architecture . It was a completely new design with no circuitry derived from previous PA-RISC microprocessors...

 (1996), AMD K5
AMD K5
The K5 was AMD's first x86 processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel's Pentium microprocessor. The K5 was an ambitious design, closer to a Pentium Pro than a Pentium regarding technical solutions and internal architecture...

 (1996) and DEC
Digital Equipment Corporation
Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...

 Alpha 21264
Alpha 21264
The Alpha 21264 was a Digital Equipment Corporation RISC microprocessor introduced in October, 1996. The 21264 implemented the Alpha instruction set architecture .- Description :...

 (1998). Notable exceptions to this trend include the Sun
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...

 UltraSPARC
UltraSPARC
The UltraSPARC is a microprocessor developed by Sun Microsystems who is now a part of Oracle Corporation and fabricated by Texas Instruments that implements the SPARC V9 instruction set architecture . It was introduced in mid-1995. It was the first microprocessor from Sun Microsystems to implement...

, HP/Intel Itanium
Itanium
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...

, Transmeta Crusoe
Transmeta Crusoe
The Crusoe is a family of x86-compatible microprocessors developed by Transmeta. Crusoe was notable for its method of achieving x86 compatibility. Instead of the instruction set architecture being implemented in hardware, or translated by specialized hardware, the Crusoe runs a software abstraction...

, Intel Atom
Intel Atom
Intel Atom is the brand name for a line of ultra-low-voltage x86 and x86-64 CPUs from Intel, designed in 45 nm CMOS and used mainly in netbooks, nettops, embedded application ranging from health care to advanced robotics and Mobile Internet devices...

, and the IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

 POWER6
POWER6
The POWER6 is a microprocessor developed by IBM that implemented the Power ISA v.2.03. When it became available in systems in 2007, it succeeded the POWER5+ as IBM's flagship Power microprocessor...

.

The logical complexity of the out-of-order schemes was the reason that this technique did not reach mainstream machines until the mid-1990s. Many low-end processors meant for cost-sensitive markets still do not use this paradigm due to large silicon area that is required to build this class of machine. Low power usage is another design goal that's harder to achieve with an OoOE design.

In-order processors

In earlier processors, the processing of instructions is normally done in these steps:
  1. Instruction fetch.
  2. If input operand
    Operand
    In mathematics, an operand is the object of a mathematical operation, a quantity on which an operation is performed.-Example :The following arithmetic expression shows an example of operators and operands:3 + 6 = 9\;...

    s are available (in registers for instance), the instruction is dispatched to the appropriate functional unit. If one or more operand is unavailable during the current clock cycle (generally because they are being fetched from memory
    Computer memory
    In computing, memory refers to the physical devices used to store programs or data on a temporary or permanent basis for use in a computer or other digital electronic device. The term primary memory is used for the information in physical systems which are fast In computing, memory refers to the...

    ), the processor stalls until they are available.
  3. The instruction is executed by the appropriate functional unit.
  4. The functional unit writes the results back to the register file
    Register file
    A register file is an array of processor registers in a central processing unit . Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports...

    .

Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps:
  1. Instruction fetch.
  2. Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations
    Reservation stations
    Reservation Stations are decentralized features of the microarchitecture of a CPU that allow for register renaming, and are used by the Tomasulo algorithm for dynamic instruction scheduling....

    ).
  3. The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions.
  4. The instruction is issued to the appropriate functional unit and executed by that unit.
  5. The results are queued.
  6. Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.


The key concept of OoO processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoO processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.

OoO processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order.

The benefit of OoO processing grows as the instruction pipeline
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

 deepens and the speed difference between main memory (or cache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.

Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues which allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was decoupled architecture
Decoupled architecture
In computer science, a decoupled architecture is a processor with out-of-order execution that separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer. The buffer's purpose is to partition the memory access and execute functions in a computer program...

. In the earlier in-order processors, these stages operated in a fairly lock-step, pipelined fashion.

To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called register renaming
Register renaming
In computer architecture, register renaming refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations.-Problem definition:...

 is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.

Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions.

The ability to issue instructions past branches which have yet to resolve is known as speculative execution
Speculative execution
Speculative execution in computer systems is doing work, the result of which may not be needed. This performance optimization technique is used in pipelined processors and other systems.-Main idea:...

.

Micro-architectural choices

  • Are the instructions dispatched to a centralized queue or to multiple distributed queues?

IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

 PowerPC
PowerPC
PowerPC is a RISC architecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM...

 processors use queues which are distributed among the different functional units while other Out-of-Order processors use a centralized queue. IBM uses the term reservation stations for their distributed queues.

  • Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps which hold the register renaming information for each instruction in flight.

Early Intel out-of-order processors use a results queue called a re-order buffer
Re-order buffer
A re-order buffer is used in a Tomasulo algorithm for out-of-order instruction execution. It allows instructions to be committed in-order....

, while most later Out-of-Order processors use register maps.

More precisely: Intel P6
P6 (microarchitecture)
The P6 microarchitecture is the sixth generation Intel x86 microarchitecture, implemented by the Pentium Pro microprocessor that was introduced in November 1995. It is sometimes referred to as i686. It was succeeded by the NetBurst microarchitecture in 2000, but eventually revived in the Pentium M...

 family microprocessors have both a ROB re-order buffer
Re-order buffer
A re-order buffer is used in a Tomasulo algorithm for out-of-order instruction execution. It allows instructions to be committed in-order....

 and a RAT register map mechanism. The ROB was motivated mainly by branch misprediction recovery.

The Intel P6
P6 (microarchitecture)
The P6 microarchitecture is the sixth generation Intel x86 microarchitecture, implemented by the Pentium Pro microprocessor that was introduced in November 1995. It is sometimes referred to as i686. It was succeeded by the NetBurst microarchitecture in 2000, but eventually revived in the Pentium M...

 family was among the earliest OoO processors, but was supplanted by the Intel Pentium 4 Willamette microarchitecture. Years later it proved to be a dead end due to performance and thermal issues, forcing Intel to go back to the P6 design and continue from there. The result were the Core and Core 2 series of processors and most notably its latest iteration, Sandy Bridge
Sandy Bridge
Sandy Bridge is the codename for a microarchitecture developed by Intel beginning in 2005 for central processing units in computers to replace the Nehalem microarchitecture...

.

See also

  • Scoreboarding
    Scoreboarding
    Scoreboarding is a centralized method, used in the CDC 6600 computer, for dynamically scheduling a pipeline so that the instructions can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependencies of every instruction are logged...

  • Tomasulo algorithm
    Tomasulo algorithm
    The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially...

  • Replay system
    Replay system
    The Replay system is a little known subsystem within the Intel Pentium 4 processor. Its primary function is to catch operations that have been mistakenly sent for execution by the processor's scheduler...

  • Dataflow architecture
    Dataflow architecture
    Dataflow architecture is a computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures do not have a program counter, or the executability and execution of instructions is solely determined based on the availability of...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK