All Topics  
Very long instruction word

 

   Email Print
   Bookmark   Link






 

Very long instruction word



 
 
Very Long Instruction Word or VLIW refers to a CPU
Central processing unit

A central processing unit is an electronic circuit that can execute computer programs. This broad definition can easily be applied to many early computers that existed long before the term "CPU" ever came into widespread usage....
 architecture designed to take advantage of instruction level parallelism
Instruction level parallelism

Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program:...
 (ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pipelining), or even executing multiple instructions entirely simultaneously as in superscalar
Superscalar

A superscalar Central processing unit architecture implements a form of parallel computer called instruction level parallelism within a single processor....
 architectures.






Discussion
Ask a question about 'Very long instruction word'
Start a new discussion about 'Very long instruction word'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Very Long Instruction Word or VLIW refers to a CPU
Central processing unit

A central processing unit is an electronic circuit that can execute computer programs. This broad definition can easily be applied to many early computers that existed long before the term "CPU" ever came into widespread usage....
 architecture designed to take advantage of instruction level parallelism
Instruction level parallelism

Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program:...
 (ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pipelining), or even executing multiple instructions entirely simultaneously as in superscalar
Superscalar

A superscalar Central processing unit architecture implements a form of parallel computer called instruction level parallelism within a single processor....
 architectures. Further improvement can be achieved by executing instructions in an order different from the order they appear in the program; this is called out-of-order execution
Out-of-order execution

In computer engineering, out-of-order execution, OoOE, is a paradigm used in most high-performance microprocessors to make use of Instruction cycle that would otherwise be wasted by a certain type of costly delay....
.

These three techniques all come at a cost: increased hardware complexity. Before executing any operations in parallel, the processor must verify that the instructions do not have interdependencies
Dependence analysis

In compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly speaking, a statement S2 depends on S1 if S1 must be executed before S2....
. There are many types of interdependencies, but a simple example would be a program in which the first instruction's result is used as an input for the second instruction. They clearly cannot execute at the same time, and the second instruction can't be executed before the first. Modern out-of-order processors use significant resources in order to take advantage of these techniques, since the scheduling of instructions must be determined dynamically as a program executes based on dependencies.

The VLIW approach, on the other hand, executes operation in parallel based on a fixed schedule determined when programs are compiled
Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three techniques described above require. As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.

As is the case with any novel architectural approach, the concept is only as useful as code generation makes it. That is, the fact that a number of special-purpose instructions are available to facilitate certain complicated operations—say, fast Fourier transform
Fast Fourier transform

A fast Fourier transform is an efficient algorithm to compute the discrete Fourier transform and its inverse. There are many distinct FFT algorithms involving a wide range of mathematics, from simple complex number to group theory and number theory; this article gives an overview of the available techniques and some of their general propert...
 (FFT) computation or certain calculations that recur in tomographic contexts—is useless if compilers are unable to spot relevant source code constructs and generate target code that duly utilizes the CPU's advanced offerings. A fortiori, the programmer must be able to express his algorithms in a manner that makes the compiler's task easier.

History has demonstrated that instruction sets must strike a careful balance between complexity and ease of exploitation. For example, the ability of the DEC
Digital Equipment Corporation

Digital Equipment Corporation was a pioneering United States company in the computer industry. It is often referred to within the computing industry as DEC ....
 VAX
VAX

VAX was an instruction set architecture developed by Digital Equipment Corporation in the mid-1970s. A 32-bit complex instruction set computer ISA, it was designed to extend or replace DEC's various Programmed Data Processor ISAs....
 to evaluate polynomials in a single instruction is useful insofar as the construct recurs frequently in certain applications; it is easy enough for the programmer to exploit and for the compiler to detect; and it is general enough to not demand unduly complicated microcode or, yet, special-purpose, hard-wired logic.

Design

In superscalar designs, the number of execution units is invisible to the instruction set. Each instruction encodes only one operation. For most superscalar designs, the instruction width is 32 bits or fewer.

In contrast, one VLIW instruction encodes multiple operations; specifically, one instruction encodes at least one operation for each execution unit of the device. For example, if a VLIW device has five execution units, then a VLIW instruction for that device would have five operation fields, each field specifying what operation should be done on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at least 64 bits in width, and on some architectures are much wider.

For example, the following is an instruction for the SHARC
Super Harvard Architecture Single-Chip Computer

The Super Harvard Architecture Single-Chip Computer is a high performance floating-point and fixed-point digital signal processor from Analog Devices,...
. In one cycle, it does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits into a single 48-bit instruction.

f12=f0*f4, f8=f8+f12, f0=dm(i0,m3), f4=pm(i8,m9);

Since the earliest days of computer architecture, some CPUs have added several additional arithmetic logic unit
Arithmetic logic unit

In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logicaloperations. The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers....
s (ALUs) to run in parallel. Superscalar
Superscalar

A superscalar Central processing unit architecture implements a form of parallel computer called instruction level parallelism within a single processor....
 CPUs use hardware to decide which operations can run in parallel. VLIW CPUs use software (the compiler) to decide which operations can run in parallel. Because the complexity of instruction scheduling is pushed off onto the compiler, the hardware's complexity can be substantially reduced.

A similar problem occurs when the result of a parallelisable instruction is used as input for a branch. Most modern CPUs "guess" which branch will be taken even before the calculation is complete, so that they can load up the instructions for the branch, or (in some architectures) even start to compute them speculatively
Speculative execution

In computer science, speculative execution is the execution of Code , the result of which may not be needed. In the context of functional programming, the term "speculative evaluation" is used instead....
. If the CPU guesses wrong, all of these instructions and their context need to be "flushed" and the correct ones loaded, which is time-consuming.

This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly, and the simplicity of the original RISC designs has been eroded. VLIW lacks this logic, and therefore lacks its power consumption, possible design defects and other negative features.

In a VLIW, the compiler uses heuristics or profile information to guess the direction of a branch. This allows it to move and preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If the branch goes the unexpected way, the compiler has already generated compensatory code to discard speculative results in order to preserve program semantics.

History

The term VLIW, and the concept of VLIW architecture itself, were invented by Josh Fisher
Josh Fisher

Joseph A. "Josh" Fisher is an United States computer science. He is a Hewlett-Packard Senior Fellow. He worked at HP Labs from 1990 through 2006 in instruction-level parallelism and in custom embedded VLIW processors and their compilers....
 in his research group at Yale University
Yale University

Yale University is a private university in New Haven, Connecticut. Founded in 1701 as the Collegiate School, Yale is the Colonial Colleges institution of higher education in the United States and is a member of the Ivy League....
 in the early 1980s. His original development of trace scheduling
Trace scheduling

Trace scheduling is an Compiler optimization technique used in compilers for computer programs.A compiler often can, by rearranging its generated machine instructions for faster execution, improve program performance....
 as a compilation technique for VLIW was developed when he was a graduate student at New York University
New York University

New York University is a private university, nonsectarian, research university in New York City. NYU's main campus is situated in the Greenwich Village section of Manhattan....
. Prior to VLIW, the notion of prescheduling functional units and instruction-level parallelism in software was well established in the practice of developing horizontal microcode
Microcode

Microcode is a layer of lowest-level instructions involved in the implementation of machine code instructions in many computers and other processors; it resides in a special high-speed memory and translates machine instructions into sequences of detailed circuit-level operations....
. Fisher's innovations were around developing a compiler that could target horizontal microcode from programs written in an ordinary programming language. He realized that in order to get good performance, and to target a wide-issue machine, it would be necessary to find parallelism beyond that which one generally finds within a basic block
Basic block

In computing, a basic block is code that has one entry point , one exit point and no jump instructions contained within it. The start of a basic block may be jumped to from more than one location....
. He developed region scheduling techniques to identify parallelism beyond basic blocks. Trace scheduling is such a technique, and involves scheduling the most likely path of basic blocks first, inserting compensation code to deal with speculative motions, scheduling the second most likely trace, and so on, until the schedule is complete.

Fisher's second innovation was the notion that the target CPU architecture should be designed to be a reasonable target for a compiler — the compiler and the architecture for VLIW must be co-designed. This was partly inspired by the difficulty Fisher observed at Yale of compiling for architectures like Floating Point Systems
Floating Point Systems

Floating Point Systems Inc. was a Beaverton, Oregon vendor of minisupercomputers. The company was founded in 1970 by former Tektronix engineer Norm Winningstad....
' FPS164, which had a complex instruction set architecture (CISC
Complex instruction set computer

A complex instruction set computer is a computer instruction set architecture in which each instruction can execute several low-level operations, such as a load from Memory , an arithmetic operator, and a memory , all in a single instruction....
) that separated instruction initiation from the instructions that saved the result, requiring very complicated scheduling algorithms. Fisher developed a set of principles characterizing a proper VLIW design, such as self-draining pipelines, wide multi-port register file
Register file

A register file is an array of processor registers in a central processing unit. Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports....
s, and memory architectures. These principles made it easier for compilers to write fast code.

The first VLIW compiler was described in a Ph.D. thesis by John Ellis, supervised by Fisher. The compiler was christened Bulldog, after Yale's mascot. John Ruttenberg also developed certain important algorithms for scheduling.

Fisher left Yale in 1984 to found a startup company, Multiflow
Multiflow

Multiflow Computer, Inc. , founded in April, 1984 near New Haven, Connecticut, was a manufacturer and seller of minisupercomputer hardware and software embodying the VLIW design style....
, along with co-founders John O'Donnell and John Ruttenberg. Multiflow produced the TRACE series of VLIW minisupercomputer
Minisupercomputer

Minisupercomputers constituted a class of computers that emerged in the mid-1980s. As scientific computing using vector processors became more popular, the need for lower-cost systems that might be used at the departmental level instead of the corporate level created an opportunity for new computer vendors to enter the market....
s, shipping their first machines around 1988. Multiflow's VLIW could issue 28 operations in parallel per instruction. The TRACE system was implemented in an MSI/LSI/VLSI mix packaged in cabinets, a technology that fell out of favor when it became more cost-effective to integrate all of the components of a processor (excluding memory) on a single chip. Multiflow was too early to catch the following wave, when chip architectures began to allow multiple issue CPUs. The major semiconductor companies recognized the value of Multiflow technology in this context, so the compiler and architecture were subsequently licensed to most of these companies.

Implementations

Cydrome
Cydrome

Cydrome was a computer company started in 1984 in Milpitas, California, California whose mission was to develop a numeric processor for its primary customer, Prime Computer....
 was a company producing VLIW numeric processors using ECL technology in the same timeframe (late 1980s). This company, like Multiflow, went out of business after a few years.

One of the licensees of the Multiflow technology is Hewlett-Packard
Hewlett-Packard

The Hewlett-Packard Company , commonly referred to as HP, is a technology corporation headquartered in Palo Alto, California, United States....
, which Josh Fisher joined after Multiflow's demise. Bob Rau, founder of Cydrome, also joined HP after Cydrome failed. These two would lead computer architecture research within Hewlett-Packard during the 1990s.

In the 1990s, Hewlett-Packard researched this problem as a side effect of ongoing work on their PA-RISC processor family. They found that the CPU could be greatly simplified by removing the complex dispatch logic from the CPU and placing it into the compiler. Today's compilers are much more complex than those from the 1980s, so the added complexity in the compiler was considered to be a small cost.

VLIW CPUs are usually constructed of multiple RISC-like functional units that operate independently. Contemporary VLIWs typically have four to eight main functional units. Compilers generate initial instruction sequences for the VLIW CPU in roughly the same manner that they do for traditional CPUs, generating a sequence of RISC-like instructions. The compiler analyzes this code for dependence relationships and resource requirements. It then schedules the instructions according to those constraints. In this process, independent instructions can be scheduled in parallel. Because VLIWs typically represent instructions scheduled in parallel with a longer instruction word that incorporates the individual instructions, this results in a much longer opcode
Opcode

In computer technology, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Their specification and format are laid out in the instruction set architecture of the processor in question ....
 (thus the term "very long") to specify what executes on a given cycle.

Examples of contemporary VLIW CPUs include the TriMedia
Trimedia

*TriMedia , a VLIW Mediaprocessor from NXP Semiconductors*Trimedia International , a pan-European PR agency based in the UK*You may have been looking for Trymedia, a digital distribution company....
 media processors by NXP (formerly Philips Semiconductors), the SHARC
Super Harvard Architecture Single-Chip Computer

The Super Harvard Architecture Single-Chip Computer is a high performance floating-point and fixed-point digital signal processor from Analog Devices,...
 DSP by Analog Devices, the C6000 DSP
Digital signal processor

A digital signal processor is a specialized microprocessor designed specifically for digital signal processing, generally in real-time computing....
 family by Texas Instruments
Texas Instruments

Texas Instruments , better known in the electronics industry as TI, is an United States company based in Dallas, Texas, Texas, United States, renowned for developing and commercializing semiconductor and computer technology....
, and the STMicroelectronics ST200 family
ST200 family

The ST200 is a family of VLIW processor cores based on technology jointly developed byHewlett-Packard Laboratories and STMicroelectronics under the name Lx....
 based on the Lx architecture (also designed by Josh Fisher). These contemporary VLIW CPUs are primarily successful as embedded media processors for consumer electronic devices.

VLIW features have also been added to configurable processor cores for SoC
System-on-a-chip

System-on-a-chip or system on chip refers to integrating all components of a computer or other Electronics system into a single integrated circuit ....
 designs. For example, Tensilica's Xtensa
Xtensa

Xtensa is a 32-bit microprocessor core designed by Tensilica.Tensilica describes it as "a configurable, extensible and synthesizable processor core" ......
 LX2 processor incorporates a technology dubbed FLIX (Flexible Length Instruction eXtensions) that allows multi-operation instructions. The Xtensa C/C++ compiler can freely intermix 32- or 64-bit FLIX instructions with the Xtensa processor's single-operation RISC instructions, which are 16 or 24 bits wide. By packing multiple operations into a wide 32- or 64-bit instruction word and allowing these multi-operation instructions to be intermixed with shorter RISC instructions, FLIX technology allows SoC designers to realize VLIW's performance advantages while eliminating the code bloat
Code bloat

Code bloat is the production of code that is perceived as unnecessarily long, slow, or otherwise wasteful of resources.Code bloat can also be caused by inadequacies in the language in which the code is written, or inadequacies in the compiler used to compile the language....
 of early VLIW architectures.

Outside embedded processing markets, Intel's Itanium
Itanium

Itanium is the brand name for 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel has released two processor families using the brand: the original Itanium and the Itanium 2....
 IA-64 EPIC
Explicitly Parallel Instruction Computing

Explicitly Parallel Instruction Computing is a term coined in 1997 by the Itanium to describe a computing paradigm that began to be researched in the early 1980s....
 appears as the only example of a widely used VLIW architecture. However, EPIC architecture is sometimes distinguished from a pure VLIW architecture, since EPIC advocates full instruction predication, rotating register files, and a very long instruction word that can encode non-parallel instruction groups.

Backward compatibility


When silicon technology allowed for wider implementations (with more execution units) to be built, the compiled programs for the earlier generation would not run on the wider implementations, as the encoding of the binary instructions depended on the number of execution units of the machine.

Transmeta
Transmeta

Transmeta Corporation was a United States-based corporation that licensed low power semiconductor intellectual property. Transmeta originally produced very long instruction word code morphing microprocessors, with a focus on reducing power consumption in electronic devices....
 addresses this issue by including a binary-to-binary software compiler layer (termed Code Morphing
Binary translation

In computing, binary translation is the emulation of one instruction set by another through translation of Machine language. Sequences of instruction s are translated from the source to the target instruction set....
) in their Crusoe implementation of the x86 architecture. Basically, this mechanism is advertised to recompile, optimize, and translate x86 opcodes at runtime into the CPU's internal machine code. Thus, the Transmeta chip is internally a VLIW processor, effectively decoupled from the x86 CISC instruction set
Instruction set

An instruction set is a list of all the instruction , and all their variations, that a processor can execute.Instructions include:* Arithmetic such as add and subtract...
 that it executes.

Intel's Itanium
Itanium

Itanium is the brand name for 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel has released two processor families using the brand: the original Itanium and the Itanium 2....
 architecture (among others) solved the backward-compatibility problem with a more general mechanism. Within each of the multiple-opcode instructions, a bit field is allocated to denote dependency on the previous VLIW instruction within the program instruction stream. These bits are set at compile time
Compile time

In computer science, compile time refers to either the operations performed by a compiler , programming language requirements that must be met by source code for it to be successfully compiled , or properties of the program that can be reasoned about at compile time....
, thus relieving the hardware from calculating this dependency information. Having this dependency information encoded into the instruction stream allows wider implementations to issue multiple non-dependent VLIW instructions in parallel per cycle, while narrower implementations would issue a smaller number of VLIW instructions per cycle.

Another perceived deficiency of VLIW architectures is the code bloat
Code bloat

Code bloat is the production of code that is perceived as unnecessarily long, slow, or otherwise wasteful of resources.Code bloat can also be caused by inadequacies in the language in which the code is written, or inadequacies in the compiler used to compile the language....
 that occurs when not all of the execution units have useful work to do and thus have to execute NOP
NOP

In computer science NOP or NOOP is an assembly language instruction, sequence of programming language statements, or protocol command that effectively does nothing at all....
s. This occurs when there are dependencies in the code and the functional pipelines must be allowed to drain before subsequent operations can proceed.

Since the number of transistors on a chip has grown, the perceived disadvantages of the VLIW have diminished in importance. The VLIW architecture is growing in popularity, particularly in the embedded market, where it is possible to customize a processor for an application in an embedded system-on-a-chip
System-on-a-chip

System-on-a-chip or system on chip refers to integrating all components of a computer or other Electronics system into a single integrated circuit ....
. Embedded VLIW products are available from several vendors, including the from Fujitsu
Fujitsu

is a Japanese company specializing in semiconductors, air conditioners, computers , telecommunications, and Service , and is headquartered in Minato, Tokyo, Tokyo....
, the from Pixelworks, the ST231 from STMicroelectronics, the from NXP, the CEVA-X DSP
CEVA-X DSP

The CEVA-X family of DSP cores from CEVA is based on the company's latest pioneering DSP architecture. The CEVA-X architecture has a unique mix of VLIW and SIMD architectures....
 from CEVA, the Jazz DSP
Jazz DSP

The Jazz DSP, by , is a VLIW embedded digital siginal processor architecture with a 2-stage instruction pipeline, and single-cycle execution units....
 from Improv Systems, and . The Texas Instruments TMS320 DSP line has evolved, in its C6xxx family, to look more like a VLIW, in contrast to the earlier C5xxx family.

See also

  • Explicitly parallel instruction computing
    Explicitly Parallel Instruction Computing

    Explicitly Parallel Instruction Computing is a term coined in 1997 by the Itanium to describe a computing paradigm that began to be researched in the early 1980s....
     (EPIC)
  • Elbrus processors
    Elbrus (computer)

    Elbrus is a series of Soviet Union supercomputer systems developed by Lebedev Institute of Precision Mechanics and Computer Engineering since the 1970s....


External links