All Topics  
Out-of-order execution

 

   Email Print
   Bookmark   Link






 

Out-of-order execution



 
 
In computer engineering
Computer engineering

Computer Engineering is a discipline that combines elements of both Electrical Engineering and Computer Science. Computer engineers are electrical engineers that have additional training in the areas of software design and hardware-software integration....
, out-of-order execution, OoOE, is a paradigm used in most high-performance microprocessor
Microprocessor

A microprocessor incorporates most or all of the functions of a central processing unit on a single integrated circuit . The first microprocessors emerged in the early 1970s and were used for electronic calculators, using Binary-coded decimal arithmetic on 4-bit Word ....
s to make use of cycles
Instruction cycle

An instruction cycle is the time period during which a computer processes a machine language instruction from its computer storage or the sequence of actions that the central processing unit performs to execute each machine code instruction in a program....
 that would otherwise be wasted by a certain type of costly delay. Most modern CPU designs include support for out of order execution.

History
Out-of-order execution is a restricted form of data flow
Dataflow architecture

Dataflow architecture is a computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures do not have a program counter or the executability and execution of instructions is solely determined based on the availability of input arguments to the instructions....
 computation, which was a major research area in computer architecture
Computer architecture

Computer architecture in computer engineering is the conceptual design and fundamental operational structure of a computer system. It is a blueprint and functional description of requirements and design implementations for the various parts of a computer, focusing largely on the way by which the central processing unit performs internally an...
 in the 1970s and early 1980s.






Discussion
Ask a question about 'Out-of-order execution'
Start a new discussion about 'Out-of-order execution'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In computer engineering
Computer engineering

Computer Engineering is a discipline that combines elements of both Electrical Engineering and Computer Science. Computer engineers are electrical engineers that have additional training in the areas of software design and hardware-software integration....
, out-of-order execution, OoOE, is a paradigm used in most high-performance microprocessor
Microprocessor

A microprocessor incorporates most or all of the functions of a central processing unit on a single integrated circuit . The first microprocessors emerged in the early 1970s and were used for electronic calculators, using Binary-coded decimal arithmetic on 4-bit Word ....
s to make use of cycles
Instruction cycle

An instruction cycle is the time period during which a computer processes a machine language instruction from its computer storage or the sequence of actions that the central processing unit performs to execute each machine code instruction in a program....
 that would otherwise be wasted by a certain type of costly delay. Most modern CPU designs include support for out of order execution.

History


Out-of-order execution is a restricted form of data flow
Dataflow architecture

Dataflow architecture is a computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures do not have a program counter or the executability and execution of instructions is solely determined based on the availability of input arguments to the instructions....
 computation, which was a major research area in computer architecture
Computer architecture

Computer architecture in computer engineering is the conceptual design and fundamental operational structure of a computer system. It is a blueprint and functional description of requirements and design implementations for the various parts of a computer, focusing largely on the way by which the central processing unit performs internally an...
 in the 1970s and early 1980s. Important academic research in this subject was led by Yale Patt
Yale Patt

Yale Nance Patt is an United States professor of electrical engineering and computer engineering at University of Texas at Austin. He holds the Ernest Cockrell, Jr....
 and his HPSm simulator. A paper by James E. Smith and A.R. Pleszkun, published in 1985 completed the scheme by describing how the precise behavior of exceptions could be maintained in out-of-order machines.

The first machine to use out-of-order execution was probably the CDC 6600
CDC 6600

The CDC 6600 was a mainframe computer from Control Data Corporation, first delivered in 1964. It is generally considered to be the first successful supercomputer, outperforming its fastest predecessor, IBM 7030 Stretch, by about three times....
 (1964), which used a scoreboard
Scoreboarding

Scoreboarding is a centralized method, used in the CDC 6600, for dynamically scheduling a Pipeline so that the instructions can Out-of-order execution when there are no conflicts and the hardware is available....
 to resolve conflicts. About three years later, the IBM 360/91 (1966) introduced Tomasulo's algorithm
Tomasulo algorithm

The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially ....
. In 1990, IBM introduced the first out-of-order microprocessor, the POWER1
POWER1

The POWER1 is a Integrated circuit Central processing unit developed and Semiconductor device fabrication by IBM that implemented the IBM POWER instruction set ....
, although out-of-order execution was limited to floating point instructions only.

Throughout the 1990s out-of-order execution became more common, and was featured in the IBM
IBM

International Business Machines Corporation, abbreviated IBM and nicknamed "Big Blue" , is a multinational corporation computer technology and consulting corporation headquartered in Armonk, New York, New York, United States....
/Motorola
Motorola

Motorola, Inc. is an United States, multinational, Fortune 100, telecommunications company based in Schaumburg, Illinois. It is a manufacturer of wireless telephone handsets, also designing and selling wireless network infrastructure equipment such as cellular transmission base stations and signal amplifiers....
 PowerPC
PowerPC

PowerPC is a RISC instruction set architecture created by the 1991 Apple Inc.?IBM?Motorola alliance, known as AIM alliance. Originally intended for personal computers, PowerPC CPUs have since become popular embedded system and high-performance processors....
 601 (1992/1993), Fujitsu
Fujitsu

is a Japanese company specializing in semiconductors, air conditioners, computers , telecommunications, and Service , and is headquartered in Minato, Tokyo, Tokyo....
/HAL
HAL

HAL or Hal may refer to:...
 SPARC64 (1995), Intel Pentium Pro
Pentium Pro

The Pentium Pro is a sixth-generation x86-based microprocessor developed and manufactured by Intel introduced in November 1995. It introduced the Intel P6 and was originally intended to replace the original Pentium in a full range of applications....
 (1995), MIPS
MIPS architecture

MIPS is a RISC instruction set architecture developed by MIPS Technologies . In the mid to late 1990s, it was estimated that one in three RISC microprocessors produced were MIPS implementations....
 R10000
R10000

The R10000, code-named "T5", is a microprocessor implementation of the MIPS architecture instruction set architecture developed by MIPS Technologies , then a division of Silicon Graphics ....
 (1996), the HP PA-8000 (1996), K5
AMD K5

The K5 was Advanced Micro Devices first X86 architecture processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel Corporation Pentium microprocessor....
 (1996) and the Alpha 21264
Alpha 21264

The Alpha 21264 is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the DEC Alpha instruction set architecture ....
 (1998). Notable exceptions to this trend include Sun's
Sun Microsystems

Sun Microsystems, Inc. is a multinational corporation vendor of computers, computer components, computer software, and information technology services, founded on February 24, 1982....
 UltraSPARC
SPARC

SPARC is a Reduced Instruction Set Computer microprocessor instruction set Computer architecture originally designed in 1985 by Sun Microsystems....
, HP/Intel's IA-64, Transmeta's
Transmeta

Transmeta Corporation was a United States-based corporation that licensed low power semiconductor intellectual property. Transmeta originally produced very long instruction word code morphing microprocessors, with a focus on reducing power consumption in electronic devices....
 Crusoe
Crusoe

Crusoe may refer to:* Crusoe , a 2008 television series based on the novel Robinson Crusoe* Crusoe , a 1989 film by Caleb Deschanel, based on the novel...
 and most recently, IBM's
IBM

International Business Machines Corporation, abbreviated IBM and nicknamed "Big Blue" , is a multinational corporation computer technology and consulting corporation headquartered in Armonk, New York, New York, United States....
 POWER6
POWER6

The POWER6 microprocessor is IBM's follow-on to the POWER5. It is part of the eCLipz, said to have a goal of converging IBM's server hardware where practical ....
 and Intel's Atom
Intel Atom

Intel Atom is the brand name for a line of x86 and x86-64 CPUs from Intel, previously List of Intel codenames Silverthorne and Diamondville processors, designed for a 45 nm CMOS process and intended for use in MIDs, smart phones and ultra-mobile PCs meant for portable and low-power applications....
.

The logical complexity of the out-of-order schemes was the reason that this technique did not reach mainstream machines until the mid-1990s. Many low-end processors meant for cost-sensitive markets still do not use this paradigm due to large silicon area that is required to build this class of machine.

Basic concept


In-order processors

In earlier processors, the processing of instructions is normally done in these steps:
  1. Instruction
    Instruction (computer science)

    In computer science, an instruction is a single operation of a central processing unit defined by an instruction set architecture. In a broader sense, an "instruction" may be any representation of an element of an executable program, such as a bytecode....
     fetch
    Fetch

    Fetch may refer to:* Fetch , the length of water over which a given wind has blown* Fetch , a game played between a human and a pet in which the human throws an object for the pet to catch and/or retrieve...
    .
  2. If input operand
    Operand

    An operand is one of the inputs of an operator in mathematics. The following arithmetic expression shows an example of operators and operands:...
    s are available (in registers for instance), the instruction is dispatched to the appropriate functional unit. If one or more operands is unavailable during the current clock cycle (generally because they are being fetched from memory
    Computer memory

    Computer memory is usually meant to refer to the semiconductor technology that is used to store information in Electronics devices. Current primary computer memory makes use of integrated circuits consisting of silicon-based transistors....
    ), the processor stalls until they are available.
  3. The instruction is executed by the appropriate functional unit.
  4. The functional unit writes the results back to the register file
    Register file

    A register file is an array of processor registers in a central processing unit. Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports....
    .


Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps:
  1. Instruction fetch.
  2. Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations
    Reservation stations

    Reservation Stations are decentralized features of the microarchitecture of a Central processing unit that allow for register renaming, and are used by the Tomasulo algorithm for dynamic instruction scheduling....
    ).
  3. The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions.
  4. The instruction is issued to the appropriate functional unit and executed by that unit.
  5. The results are queued.
  6. Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.


The key concept of OoO processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoO processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.

OoO processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order.

The benefit of OoO processing grows as the instruction pipeline
Instruction pipeline

File:5 Stage Pipeline.svgAn instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
 deepens and the speed difference between main memory (or cache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.

Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues which allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was decoupled architecture
Decoupled architecture

In computer science a decoupled architecture is a processor with out-of-order execution that separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer....
. In the earlier in-order processors, these stages operated in a fairly lock-step, pipelined fashion.

To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called register renaming
Register renaming

In computer engineering, register renaming refers to a technique usedto avoid unnecessary serialization of program operations imposed by the reuse...
 is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.

Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions.

The ability to issue instructions past branches which have yet to resolve is known as speculative execution
Speculative execution

In computer science, speculative execution is the execution of Code , the result of which may not be needed. In the context of functional programming, the term "speculative evaluation" is used instead....
.

Micro-architectural choices

  • Are the instructions dispatched to a centralized queue or to multiple distributed queues?


IBM
IBM

International Business Machines Corporation, abbreviated IBM and nicknamed "Big Blue" , is a multinational corporation computer technology and consulting corporation headquartered in Armonk, New York, New York, United States....
 PowerPC
PowerPC

PowerPC is a RISC instruction set architecture created by the 1991 Apple Inc.?IBM?Motorola alliance, known as AIM alliance. Originally intended for personal computers, PowerPC CPUs have since become popular embedded system and high-performance processors....
 processors use queues which are distributed among the different functional units while other Out-of-Order processors use a centralized queue. IBM uses the term reservation stations for their distributed queues.


  • Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps which hold the register renaming information for each instruction in flight.


Early Intel out-of-order processors use a results queue called a re-order buffer
Re-order buffer

A re-order buffer is used in a Tomasulo algorithm for out-of-order execution instruction execution. It allows instructions to be committed in-order....
, while most later Out-of-Order processors use register maps.


See also

  • Scoreboarding
    Scoreboarding

    Scoreboarding is a centralized method, used in the CDC 6600, for dynamically scheduling a Pipeline so that the instructions can Out-of-order execution when there are no conflicts and the hardware is available....
  • Tomasulo algorithm
    Tomasulo algorithm

    The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially ....
  • Replay system
    Replay system

    The Replay system is a little known subsystem within the Intel Pentium 4 processor. Its primary function is to catch operations that have been mistakenly sent for execution by the processor's Instruction scheduling....


External links

  • , a paper by Smith and Pleszkun (1985).