Very long instruction word
Encyclopedia
Very long instruction word or VLIW refers to a CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 architecture designed to take advantage of instruction level parallelism
Instruction level parallelism
Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f...

 (ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pipelining), or even executing multiple instructions entirely simultaneously as in superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...

 architectures. Further improvement can be achieved by executing instructions in an order different from the order they appear in the program; this is called out-of-order execution
Out-of-order execution
In computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...

.

As often implemented, these three techniques all come at a cost: increased hardware complexity. Before executing any operations in parallel,
the processor must verify that the instructions do not have interdependencies
Dependence analysis
In compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly speaking, a statement S2 depends on S1 if S1 must be executed before S2...

. For example a first instruction's result is used as a second instruction's input. Clearly, they cannot execute at the same time, and the second instruction can't be executed before the first. Modern out-of-order processors have increased the hardware resources which do the scheduling of instructions and determining of interdependencies.

The VLIW approach, on the other hand, executes operations in parallel based on a fixed schedule determined when programs are compiled
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three techniques described above require. As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.

As is the case with any novel architectural approach, the concept is only as useful as code generation makes it. An architecture designed for use in signal processing may have a number of special-purpose instructions to facilitate certain complicated operations such as fast Fourier transform
Fast Fourier transform
A fast Fourier transform is an efficient algorithm to compute the discrete Fourier transform and its inverse. "The FFT has been called the most important numerical algorithm of our lifetime ." There are many distinct FFT algorithms involving a wide range of mathematics, from simple...

 (FFT) computation or certain calculations that recur in tomographic contexts
Tomography
Tomography refers to imaging by sections or sectioning, through the use of any kind of penetrating wave. A device used in tomography is called a tomograph, while the image produced is a tomogram. The method is used in radiology, archaeology, biology, geophysics, oceanography, materials science,...

. However, these optimized capabilities are useless unless compilers are able to spot relevant source code constructs and generate target code that duly utilizes the CPU's advanced offerings. Therefore, programmers must be able to express their algorithms in a manner that makes the compiler's task easier.

Modern graphics processing units (GPUs) have specialized matrix computation capabilities used in image processing. The idea behind the OpenCL
OpenCL
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language for writing kernels , plus APIs that are used to define and then control the platforms...

 language is to enable a programmer (or perhaps even an advanced compiler) to optimize certain non-graphic operations such an FFT algorithm (example algorithm in OpenCL article) to make use of such advanced capabilities.

Design

In superscalar designs, the number of execution units is invisible to the instruction set. Each instruction encodes only one operation. For most superscalar designs, the instruction width is 32 bits or fewer. VLIW is a type of MIMD
MIMD
In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data...

.

In contrast, one VLIW instruction encodes multiple operations; specifically, one instruction encodes at least one operation for each execution unit of the device. For example, if a VLIW device has five execution units, then a VLIW instruction for that device would have five operation fields, each field specifying what operation should be done on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at least 64 bits wide, and on some architectures are much wider.

For example, the following is an instruction for the SHARC
Super Harvard Architecture Single-Chip Computer
The Super Harvard Architecture Single-Chip Computer is a high performance floating-point and fixed-point DSP from Analog Devices,...

. In one cycle, it does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits into a single 48-bit instruction.
f12=f0*f4, f8=f8+f12, f0=dm(i0,m3), f4=pm(i8,m9);


Since the earliest days of computer architecture, some CPUs have added several additional arithmetic logic unit
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s (ALUs) to run in parallel. Superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...

 CPUs use hardware to decide which operations can run in parallel. VLIW CPUs use software (the compiler) to decide which operations can run in parallel. Because the complexity of instruction scheduling is pushed off onto the compiler, the hardware's complexity can be substantially reduced.

A similar problem occurs when the result of a parallelisable instruction is used as input for a branch. Most modern CPUs "guess" which branch will be taken even before the calculation is complete, so that they can load up the instructions for the branch, or (in some architectures) even start to compute them speculatively
Speculative execution
Speculative execution in computer systems is doing work, the result of which may not be needed. This performance optimization technique is used in pipelined processors and other systems.-Main idea:...

. If the CPU guesses wrong, all of these instructions and their context need to be "flushed" and the correct ones loaded, which is time-consuming.

This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly, and the simplicity of the original RISC designs has been eroded. VLIW lacks this logic, and therefore lacks its power consumption, possible design defects and other negative features.

In a VLIW, the compiler uses heuristics or profile information to guess the direction of a branch. This allows it to move and preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If the branch goes the unexpected way, the compiler has already generated compensatory code to discard speculative results to preserve program semantics.

The acronym VLIW may also refer to Variable Length Instruction Word, a CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 instruction set designed to load (or copy) a literal value count of inline Machine code
Machine code
Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...

 to the on-chip RAM for higher speed CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 decoding.

History

The term VLIW, and the concept of VLIW architecture itself, were invented by Josh Fisher
Josh Fisher
Joseph A. "Josh" Fisher is an American computer scientist. He is a Hewlett-Packard Senior Fellow. He worked at HP Labs from 1990 through 2006 in instruction-level parallelism and in custom embedded VLIW processors and their compilers. Fisher retired from active employment at HP in 2006.Fisher...

 in his research group at Yale University
Yale University
Yale University is a private, Ivy League university located in New Haven, Connecticut, United States. Founded in 1701 in the Colony of Connecticut, the university is the third-oldest institution of higher education in the United States...

 in the early 1980s. His original development of trace scheduling
Trace scheduling
Trace scheduling is an optimization technique used in compilers for computer programs.A compiler often can, by rearranging its generated machine instructions for faster execution, improve program performance...

 as a compilation technique for VLIW was developed when he was a graduate student at New York University
New York University
New York University is a private, nonsectarian research university based in New York City. NYU's main campus is situated in the Greenwich Village section of Manhattan...

. Prior to VLIW, the notion of prescheduling functional units and instruction-level parallelism in software was well established in the practice of developing horizontal microcode. Fisher's innovations were around developing a compiler that could target horizontal microcode from programs written in an ordinary programming language. He realized that to get good performance and target a wide-issue machine, it would be necessary to find parallelism beyond that generally within a basic block
Basic block
In computing, a basic block is a portion of the code within a program with certain desirable properties that make it highly amenable to analysis. Compilers usually decompose programs into their basic blocks as a first step in the analysis process...

. He developed region scheduling techniques to identify parallelism beyond basic blocks. Trace scheduling is such a technique, and involves scheduling the most likely path of basic blocks first, inserting compensation code to deal with speculative motions, scheduling the second most likely trace, and so on, until the schedule is complete.

Fisher's second innovation was the notion that the target CPU architecture should be designed to be a reasonable target for a compiler — the compiler and the architecture for VLIW must be co-designed. This was partly inspired by the difficulty Fisher observed at Yale of compiling for architectures like Floating Point Systems
Floating Point Systems
Floating Point Systems Inc. was a Beaverton, Oregon vendor of minisupercomputers. The company was founded in 1970 by former Tektronix engineer Norm Winningstad....

' FPS164, which had a complex instruction set architecture (CISC
Complex instruction set computer
A complex instruction set computer , is a computer where single instructions can execute several low-level operations and/or are capable of multi-step operations or addressing modes within single instructions...

) that separated instruction initiation from the instructions that saved the result, requiring very complicated scheduling algorithms. Fisher developed a set of principles characterizing a proper VLIW design, such as self-draining pipelines, wide multi-port register file
Register file
A register file is an array of processor registers in a central processing unit . Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports...

s, and memory architecture
Memory architecture
Memory architecture describes the methods used to implement electronic computer data storage in a manner that is a combination of the fastest, most reliable, most durable, and least expensive way to store and retrieve information...

s. These principles made it easier for compilers to write fast code.

The first VLIW compiler was described in a Ph.D. thesis by John Ellis, supervised by Fisher. The compiler was christened Bulldog, after Yale's mascot. John Ruttenberg also developed certain important algorithms for scheduling.

Fisher left Yale in 1984 to found a startup company, Multiflow
Multiflow
Multiflow Computer, Inc. , founded in April, 1984 near New Haven, Connecticut, USA, was a manufacturer and seller of minisupercomputer hardware and software embodying the VLIW design style...

, along with co-founders John O'Donnell and John Ruttenberg. Multiflow produced the TRACE series of VLIW minisupercomputer
Minisupercomputer
Minisupercomputers constituted a short-lived class of computers that emerged in the mid-1980s. As scientific computing using vector processors became more popular, the need for lower-cost systems that might be used at the departmental level instead of the corporate level created an opportunity for...

s, shipping their first machines in 1987. Multiflow's VLIW could issue 28 operations in parallel per instruction. The TRACE system was implemented in an MSI/LSI/VLSI mix packaged in cabinets, a technology that fell out of favor when it became more cost-effective to integrate all of the components of a processor (excluding memory) on a single chip. Multiflow was too early to catch the following wave, when chip architectures began to allow multiple issue CPUs. The major semiconductor companies recognized the value of Multiflow technology in this context, so the compiler and architecture were subsequently licensed to most of these companies.

Implementations

Cydrome
Cydrome
Cydrome was a computer company started in 1984 in San Jose, California whose mission was to develop a numeric processor. The founders were David Yen, Wei Yen, Ross Towle, Arun Kumar, and Bob Rau...

 was a company producing VLIW numeric processors using ECL
Emitter-coupled logic
In electronics, emitter-coupled logic , is a logic family that achieves high speed by using an overdriven BJT differential amplifier with single-ended input, whose emitter current is limited to avoid the slow saturation region of transistor operation....

 technology in the same timeframe (late 1980s). This company, like Multiflow, went out of business after a few years.

One of the licensees of the Multiflow technology is Hewlett-Packard
Hewlett-Packard
Hewlett-Packard Company or HP is an American multinational information technology corporation headquartered in Palo Alto, California, USA that provides products, technologies, softwares, solutions and services to consumers, small- and medium-sized businesses and large enterprises, including...

, which Josh Fisher
Josh Fisher
Joseph A. "Josh" Fisher is an American computer scientist. He is a Hewlett-Packard Senior Fellow. He worked at HP Labs from 1990 through 2006 in instruction-level parallelism and in custom embedded VLIW processors and their compilers. Fisher retired from active employment at HP in 2006.Fisher...

 joined after Multiflow's demise. Bob Rau
Bob Rau
Bantwal Ramakrishna "Bob" Rau was a computer engineer. Rau was a founder and chief architect of Cydrome, where he helped develop the Very long instruction word technology that is now standard in modern computer processors. Rau was the recipient of the 2002 Eckert–Mauchly Award.-External links:* *...

, founder of Cydrome, also joined HP after Cydrome failed. These two would lead computer architecture research within Hewlett-Packard during the 1990s.

In addition to the above systems, at around the same period (i.e. 1989-1990), Intel implemented VLIW in the Intel i860
Intel i860
The Intel i860 was a RISC microprocessor from Intel, first released in 1989. The i860 was one of Intel's first attempts at an entirely new, high-end instruction set since the failed Intel i432 from the 1980s...

, their first 64bit microprocessor; the i860 was also the first processor to implement VLIW on a single chip. This processor could operate in both simple RISC mode and VLIW mode:

In the early 1990s, Intel introduced the i860 RISC microprocessor. This simple chip had two modes of operation: a scalar mode and a VLIW mode. In the VLIW mode, the processor always fetched two instructions and assumed that one was an integer instruction and the other floating-point


The i860's VLIW mode was used extensively in embedded DSP applications since the application execution and datasets were simple, well ordered and predictable, allowing the designer to take full advantage of the parallel execution advantages that VLIW lent itself to; in VLIW mode the i860 was able to maintain floating-point performance in the range of 20-40 double-precision MFLOPS (an extremely high figure for its time and for a processor operating at 25-50Mhz).

In the 1990s, Hewlett-Packard researched this problem as a side effect of ongoing work on their PA-RISC
PA-RISC
PA-RISC is an instruction set architecture developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer architecture, where the PA stands for Precision Architecture...

 processor family. They found that the CPU could be greatly simplified by removing the complex dispatch logic from the CPU and placing it into the compiler. Today's compilers are much more complex than those from the 1980s, so the added complexity in the compiler was considered to be a small cost.

VLIW CPUs are usually constructed of multiple RISC-like functional units that operate independently. Contemporary VLIWs typically have four to eight main functional units. Compilers generate initial instruction sequences for the VLIW CPU in roughly the same manner that they do for traditional CPUs, generating a sequence of RISC-like instructions. The compiler analyzes this code for dependence relationships and resource requirements. It then schedules the instructions according to those constraints. In this process, independent instructions can be scheduled in parallel. Because VLIWs typically represent instructions scheduled in parallel with a longer instruction word that incorporates the individual instructions, this results in a much longer opcode
Opcode
In computer science engineering, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Their specification and format are laid out in the instruction set architecture of the processor in question...

 (thus the term "very long") to specify what executes on a given cycle.

Examples of contemporary VLIW CPUs include the TriMedia
TriMedia
TriMedia can refer to:* TriMedia , a media processor made by Philips/NXP Semiconductors* Trimedia International, a European public-relations agency* Tri-Media Productions, a Philipino TV production companySimilarly named pages:...

 media processors by NXP (formerly Philips Semiconductors), the SHARC
Super Harvard Architecture Single-Chip Computer
The Super Harvard Architecture Single-Chip Computer is a high performance floating-point and fixed-point DSP from Analog Devices,...

 DSP by Analog Devices, the C6000 DSP
Digital signal processor
A digital signal processor is a specialized microprocessor with an architecture optimized for the fast operational needs of digital signal processing.-Typical characteristics:...

 family by Texas Instruments
Texas Instruments
Texas Instruments Inc. , widely known as TI, is an American company based in Dallas, Texas, United States, which develops and commercializes semiconductor and computer technology...

, and the STMicroelectronics ST200 family
ST200 family
The ST200 is a family of very long instruction word processor cores based on technology jointly developed by Hewlett-Packard Laboratories and STMicroelectronics under the name Lx...

 based on the Lx architecture (also designed by Josh Fisher). These contemporary VLIW CPUs are primarily successful as embedded media processors for consumer electronic devices.

VLIW features have also been added to configurable processor cores for SoC
System-on-a-chip
A system on a chip or system on chip is an integrated circuit that integrates all components of a computer or other electronic system into a single chip. It may contain digital, analog, mixed-signal, and often radio-frequency functions—all on a single chip substrate...

 designs. For example, Tensilica's Xtensa LX2 processor incorporates a technology dubbed FLIX (Flexible Length Instruction eXtensions) that allows multi-operation instructions. The Xtensa C/C++ compiler can freely intermix 32- or 64-bit FLIX instructions with the Xtensa processor's single-operation RISC instructions, which are 16 or 24 bits wide. By packing multiple operations into a wide 32- or 64-bit instruction word and allowing these multi-operation instructions to be intermixed with shorter RISC instructions, FLIX technology allows SoC designers to realize VLIW's performance advantages while eliminating the code bloat
Code bloat
Code bloat is the production of code that is perceived as unnecessarily long, slow, or otherwise wasteful of resources. Code bloat can be caused by inadequacies in the language in which the code is written, inadequacies in the compiler used to compile the code, or by a programmer...

 of early VLIW architectures.
The Infineon Carmel DSP is another VLIW processor core intended for SoC; it uses a similar code density improvement technique called "configurable long instruction word" (CLIW).
Outside embedded processing markets, Intel's Itanium
Itanium
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...

 IA-64 EPIC
Explicitly Parallel Instruction Computing
Explicitly parallel instruction computing is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures...

 appears as the only example of a widely used VLIW CPU architecture. However, EPIC architecture is sometimes distinguished from a pure VLIW architecture, since EPIC advocates full instruction predication, rotating register files, and a very long instruction word that can encode non-parallel instruction groups. VLIWs have, however, gained significant consumer penetration in the GPU
Graphics processing unit
A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...

 market. In particular, ATI/AMD's family of GPU architectures (including the R600
Radeon R600
The graphics processing unit codenamed the Radeon R600 is the foundation of the Radeon HD 2000/3000 series and the FireGL 2007 series video cards developed by ATI Technologies...

, R700
Radeon R700
The Radeon R700 is the engineering codename for a graphics processing unit series developed by Advanced Micro Devices under the ATI brand name. The foundation chip, codenamed RV770, was announced and demonstrated on June 16, 2008 as part of the FireStream 9250 and Cinema 2.0 initiative launch media...

, R800
Radeon R800
The Evergreen series is a family of GPUs developed by Advanced Micro Devices for its Radeon line under the ATI brand name.-Release:The existence was spotted on a presentation slide from AMD Technology Analyst Day July 2007 as "R8xx"...

, and R900) are VLIWs.

Backward compatibility

When silicon technology allowed for wider implementations (with more execution units) to be built, the compiled programs for the earlier generation would not run on the wider implementations, as the encoding of the binary instructions depended on the number of execution units of the machine.

Transmeta
Transmeta
Transmeta Corporation was a US-based corporation that licensed low power semiconductor intellectual property. Transmeta originally produced very long instruction word code morphing microprocessors, with a focus on reducing power consumption in electronic devices. It was founded in 1995 by Bob...

 addresses this issue by including a binary-to-binary software compiler layer (termed Code Morphing
Binary translation
In computing, binary translation is the emulation of one instruction set by another through translation of code. Sequences of instructions are translated from the source to the target instruction set...

) in their Crusoe implementation of the x86 architecture. Basically, this mechanism is advertised to recompile, optimize, and translate x86 opcodes at runtime into the CPU's internal machine code. Thus, the Transmeta chip is internally a VLIW processor, effectively decoupled from the x86 CISC instruction set
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...

 that it executes.

Intel's Itanium
Itanium
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...

 architecture (among others) solved the backward-compatibility problem with a more general mechanism. Within each of the multiple-opcode instructions, a bit field is allocated to denote dependency on the previous VLIW instruction within the program instruction stream. These bits are set at compile time
Compile time
In computer science, compile time refers to either the operations performed by a compiler , programming language requirements that must be met by source code for it to be successfully compiled , or properties of the program that can be reasoned about at compile time.The operations performed at...

, thus relieving the hardware from calculating this dependency information. Having this dependency information encoded into the instruction stream allows wider implementations to issue multiple non-dependent VLIW instructions in parallel per cycle, while narrower implementations would issue a smaller number of VLIW instructions per cycle.

Another perceived deficiency of VLIW architectures is the code bloat
Code bloat
Code bloat is the production of code that is perceived as unnecessarily long, slow, or otherwise wasteful of resources. Code bloat can be caused by inadequacies in the language in which the code is written, inadequacies in the compiler used to compile the code, or by a programmer...

 that occurs when not all of the execution units have useful work to do and thus have to execute NOP
NOP
In computer science, NOP or NOOP is an assembly language instruction, sequence of programming language statements, or computer protocol command that effectively does nothing at all....

s. This occurs when there are dependencies in the code and the functional pipelines must be allowed to drain before subsequent operations can proceed.

Since the number of transistors on a chip has grown, the perceived disadvantages of the VLIW have diminished in importance. The VLIW architecture is growing in popularity, particularly in the embedded market, where it is possible to customize a processor for an application in an embedded system-on-a-chip
System-on-a-chip
A system on a chip or system on chip is an integrated circuit that integrates all components of a computer or other electronic system into a single chip. It may contain digital, analog, mixed-signal, and often radio-frequency functions—all on a single chip substrate...

. Embedded VLIW products are available from several vendors, including the FR-V from Fujitsu
Fujitsu
is a Japanese multinational information technology equipment and services company headquartered in Tokyo, Japan. It is the world's third-largest IT services provider measured by revenues....

, the BSP15/16 from Pixelworks, the ST231 from STMicroelectronics, the TriMedia from NXP, the CEVA-X DSP
CEVA-X DSP
CEVA is an Israeli company headquartered in Mountain View, California and specializes in DSP processor technology for the semiconductor industry...

 from CEVA, the Jazz DSP
Jazz DSP
The Jazz DSP, by , is a VLIW embedded digital siginal processor architecture with a 2-stage instruction pipeline, and single-cycle execution units. The baseline DSP includes one arithmetic logic unit , dual memory interfaces, and the control unit...

 from Improv Systems, and Silicon Hive. The Texas Instruments TMS320 DSP line has evolved, in its C6xxx family, to look more like a VLIW, in contrast to the earlier C5xxx family.

See also

  • Explicitly parallel instruction computing
    Explicitly Parallel Instruction Computing
    Explicitly parallel instruction computing is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures...

     (EPIC)
  • Transport triggered architecture (TTA)
  • Elbrus processors
    Elbrus (computer)
    The Elbrus is a line of Soviet and Russian computer systems developed by Lebedev Institute of Precision Mechanics and Computer Engineering.In 1992 a spin-off company Moscow Center of SPARC Technologies was created and continued development....


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK