Pipelining redirects here. For HTTP pipelining, see HTTP pipelining
HTTP pipelining
[] pipelining is a technique in which multiple [] requests are written out to a single Internet socket without waiting for the corresponding responses.... .
machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.]]
An instruction pipeline is a technique used in the design of computer
Computer
A computer is a machine that manipulates Data according to a list of Code .The first devices that resemble modern computers date to the mid-20th century , although the computer concept and various machines similar to computers existed earlier.... s and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).
The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step.
Discussion
Ask a question about 'Instruction pipeline'
Start a new discussion about 'Instruction pipeline'
Answer questions from other users
Full Discussion Forum
Encyclopedia
Pipelining redirects here. For HTTP pipelining, see HTTP pipelining
HTTP pipelining
[] pipelining is a technique in which multiple [] requests are written out to a single Internet socket without waiting for the corresponding responses.... .
machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.]]
An instruction pipeline is a technique used in the design of computer
Computer
A computer is a machine that manipulates Data according to a list of Code .The first devices that resemble modern computers date to the mid-20th century , although the computer concept and various machines similar to computers existed earlier.... s and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).
The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.)
The origin of pipelining is thought to be either the ILLIAC II
ILLIAC II
The ILLIAC II was a revolutionary super-computer built by the University of Illinois at Urbana-Champaign that became operational in 1962. The concept, proposed in 1958, pioneered ECL circuitry, pipelining, and transistor memory with a design goal of 100x speedup compared to ILLIAC I.... project or the IBM Stretch project. The IBM Stretch Project proposed the terms, "Fetch, Decode, and Execute" that became common usage.
A central processing unit is an electronic circuit that can execute computer programs. This broad definition can easily be applied to many early computers that existed long before the term "CPU" ever came into widespread usage.... are driven by a clock. The CPU consists internally of logic and memory (flip flop
Flip-flop (electronics)
In digital circuits, a flip-flop is a term referring to an electronic circuit that has two stable states and thereby is capable of serving as one bit of computer storage.... s). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the RISC pipeline is broken into five stages with a set of flip flops between each stage.
Instruction fetch
Instruction decode and register fetch
Execute
Memory access
Register write back
Hazards: When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is fed to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced.
Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. If 4 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.
Advantages and Disadvantages
Pipelining does not help in all cases. There are several possible disadvantages. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.
Advantages of Pipelining:
The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases.
Some combinatorial circuits such as adders or multipliers can be made faster by adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more complex combinatorial circuit.
Disadvantages of Pipelining:
A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.
The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops
Flip-flop (electronics)
In digital circuits, a flip-flop is a term referring to an electronic circuit that has two stable states and thereby is capable of serving as one bit of computer storage.... must be added to the data path of a pipelined processor.
A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.
Examples
Generic pipeline
To the right is a generic pipeline with four stages:
Fetch
Decode
Execute
Write-back
The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline.
Execution is as follows:
Time
Execution
0
Four instructions are awaiting to be executed
1
the green instruction is fetched from memory
2
the green instruction is decoded
the purple instruction is fetched from memory
3
the green instruction is executed (actual operation is performed)
the purple instruction is decoded
the blue instruction is fetched
4
the green instruction's results are written back to the register file or memory
the purple instruction is executed
the blue instruction is decoded
the red instruction is fetched
5
the green instruction is completed
the purple instruction is written back
the blue instruction is executed
the red instruction is decoded
6
The purple instruction is completed
the blue instruction is written back
the red instruction is executed
7
the blue instruction is completed
the red instruction is written back
8
the red instruction is completed
9
All instructions are executed
Bubble
When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple instruction is delayed as well but everything "ahead" of the purple nstruction continues with execution.
Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7.
Bubbles are like stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a nop code.
Example 1
A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to:
LOAD A, R1
LOAD B, R2
ADD R1, R2, R3
STORE R3, C
LOAD next instruction
In computer architecture, a processor register is a small amount of Computer storage available on the CPU whose contents can be accessed more quickly than storage available elsewhere.... in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into these registers, then added, and the result is stored in a memory location labeled 'C'.
In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages.
On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage.
Pipelining doesn't reduce the time it takes to complete an instruction rather it increases the number of instructions that can be processed at once and it reduces the delay between completed instructions- called 'throughput'. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR
Atmel AVR
The AVR is a Modified Harvard architecture 8-bit Reduced instruction set computer single chip microcontroller which was developed by Atmel in 1996.... and the PIC microcontroller
PIC microcontroller
PIC is a family of Harvard architecture microcontrollers made by Microchip Technology, derived from the PIC1640 originally developed by General Instrument's Microelectronics Division.... each have a 2 stage pipeline). Intel Pentium 4 processors have 20 stage pipelines.
Example 2
To better visualize the concept, we can look at a theoretical 3-stage pipeline:
An assembly language is a low-level language for programming computers. It implements a symbolic representation of the numeric machine codes and other constants needed to program a particular CPU architecture.... listing to be executed:
LOAD #40, A ; load 40 in A
MOVE A, B; copy A in B
ADD #20, B ; add 20 to B
STORE B, 0x300 ; store B into memory cell 0x300
This is how it would be executed:
Clock 1
Load
Execute
Store
LOAD
The LOAD instruction is fetched from memory.
Clock 2
Load
Execute
Store
MOVE
LOAD
The LOAD instruction is executed, while the MOVE instruction is fetched from memory.
Clock 3
Load
Execute
Store
ADD
MOVE
LOAD
The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A.
In the meantime, the MOVE instruction is being executed.
Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction.
Clock 4
Load
Execute
Store
STORE
ADD
MOVE
The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating.
And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards
Hazard (computer architecture)
In computer architecture, a hazard is a potential problem that can happen in a Instruction pipelined central processing unit. It refers to the possibility of erroneous computation when a CPU tries to simultaneously execute multiple instructions which exhibit data dependence.... (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do.
Complications
Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4
Pentium 4
The Pentium 4 brand refers to Intel's line of single-core mainstream Desktop computer and laptop central processing units introduced on November 20, 2000 .... ) The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D
Pentium D
The Pentium D brand refers to two series of desktop dual-core 64-bit x86 CPU with the NetBurst microarchitecture manufactured by Intel Corporation.... derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerator X10q has a pipeline more than a thousand stages long . The downside of a long pipeline is that when a program branches, the processor cannot know where to fetch the next instruction from and must wait until the branch instruction to finish, leaving the pipeline behind it empty. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages. Branch prediction attempts to alleviate this problem by guessing whether the branch will be taken or not and speculatively executing
Speculative execution
In computer science, speculative execution is the execution of Code , the result of which may not be needed. In the context of functional programming, the term "speculative evaluation" is used instead.... the code path that it predicts will be taken. When its predictions are correct, branch prediction avoids the penalty associated with branching. However, branch prediction itself can end up exacerbating the problem if branches are predicted poorly, as the incorrect code path which has begun execution must be flushed from the pipeline before resuming execution at the correct location.
A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation. Supercomputers introduced in the 1960s were designed primarily by Seymour Cray at Control Data Corporation , and led the market into the 1970s until Cray left to form his own company, Cray Research.... , programs are specially written to branch rarely and so very long pipelines are ideal to speed up the computations, as long pipelines are designed to reduce clocks per instruction (CPI)
Cycles Per Instruction
In computer architecture, Cycles per instruction is a term used to describe one aspect of a central processing unit performance: the number of clock cycles that happen when an Instruction is being executed.... . If branching happens constantly, re-ordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Programs such as gcov can be used to examine how often particular branches are actually executed using a technique known as coverage analysis
Code coverage
Code coverage is a measure used in software testing. It describes the degree to which the source code of a computer program has been tested. It is a form of testing that inspects the code directly and is therefore a form of white box testing.... , however such analysis is often a last resort for optimization.
Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch Input Queue
Prefetch input queue
Most modern processors load their instructions some clock cycles before they execute them. This is achieved by pre-loading machine code from memory into a prefetch input queue .... . Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs
Self-modifying code
In computer science, self-modifying code is Code that alters its own Instruction while it is Execution - usually to reduce the instruction path length and improve performance.... .
A wait state is a delay experienced by a computer central processing unit when accessing external computer storage or another device that is slow to respond....
In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline....
Parallel computing is a form of computing in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved Concurrency ....
Dual pipelining was introduced in the Intel Pentium processor. This technology allows the processor to break down a command into two shorter commands and execute them simultaneously when it receives a long command....