Stream processing
Encyclopedia
Stream processing is a computer programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...

 paradigm, related to SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

 (single instruction, multiple data), that allows some applications to more easily exploit a limited form of parallel processing
Parallel computing
Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

. Such applications can use multiple computational units, such as the FPU
Floating point unit
A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root...

s on a GPU
Graphics processing unit
A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...

 or field programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.

The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a set of data (a stream), a series of operations (kernel functions) are applied to each element in the stream. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Kernel functions are usually pipelined, and local on-chip memory is reused to minimize external memory bandwidth. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboarding
Scoreboarding
Scoreboarding is a centralized method, used in the CDC 6600 computer, for dynamically scheduling a pipeline so that the instructions can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependencies of every instruction are logged...

, for example, to launch DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....

s at runtime, when dependencies become known. The elimination of manual DMA management reduces software complexity, and the elimination of hardware caches reduces the amount of die area not dedicated to computational units such as ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s.

During the 1980s stream processing was explored within dataflow programming. An example is the language SISAL
SISAL
SISAL is a general-purpose single assignment functional programming language with strict semantics, implicit parallelism, and efficient array handling. SISAL outputs a dataflow graph in Intermediary Form 1...

 (Streams and Iteration in a Single Assignment Language).

Applications

Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications (such as image, video and digital signal processing) but less so for general purpose processing with more randomized data access (such as databases). By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 design may be tuned for maximum efficiency or a trade-off for flexibility.

Stream processing is especially suitable for applications that exhibit three application characteristics:
  • Compute Intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity.
  • Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records.
  • Data Locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.

Stream processing is a computer programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...

 paradigm, related to SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

 (single instruction, multiple data), that allows some applications to more easily exploit a limited form of parallel processing
Parallel computing
Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

. Such applications can use multiple computational units, such as the FPU
Floating point unit
A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root...

s on a GPU
Graphics processing unit
A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...

 or field programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.

The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a set of data (a stream), a series of operations (kernel functions) are applied to each element in the stream. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Kernel functions are usually pipelined, and local on-chip memory is reused to minimize external memory bandwidth. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboarding
Scoreboarding
Scoreboarding is a centralized method, used in the CDC 6600 computer, for dynamically scheduling a pipeline so that the instructions can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependencies of every instruction are logged...

, for example, to launch DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....

s at runtime, when dependencies become known. The elimination of manual DMA management reduces software complexity, and the elimination of hardware caches reduces the amount of die area not dedicated to computational units such as ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s.

During the 1980s stream processing was explored within dataflow programming. An example is the language SISAL
SISAL
SISAL is a general-purpose single assignment functional programming language with strict semantics, implicit parallelism, and efficient array handling. SISAL outputs a dataflow graph in Intermediary Form 1...

 (Streams and Iteration in a Single Assignment Language).

Applications

Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications (such as image, video and digital signal processing) but less so for general purpose processing with more randomized data access (such as databases). By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 design may be tuned for maximum efficiency or a trade-off for flexibility.

Stream processing is especially suitable for applications that exhibit three application characteristics:
  • Compute Intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity.
  • Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records.
  • Data Locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.

Stream processing is a computer programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...

 paradigm, related to SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

 (single instruction, multiple data), that allows some applications to more easily exploit a limited form of parallel processing
Parallel computing
Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

. Such applications can use multiple computational units, such as the FPU
Floating point unit
A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root...

s on a GPU
Graphics processing unit
A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...

 or field programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.

The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a set of data (a stream), a series of operations (kernel functions) are applied to each element in the stream. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Kernel functions are usually pipelined, and local on-chip memory is reused to minimize external memory bandwidth. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboarding
Scoreboarding
Scoreboarding is a centralized method, used in the CDC 6600 computer, for dynamically scheduling a pipeline so that the instructions can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependencies of every instruction are logged...

, for example, to launch DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....

s at runtime, when dependencies become known. The elimination of manual DMA management reduces software complexity, and the elimination of hardware caches reduces the amount of die area not dedicated to computational units such as ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s.

During the 1980s stream processing was explored within dataflow programming. An example is the language SISAL
SISAL
SISAL is a general-purpose single assignment functional programming language with strict semantics, implicit parallelism, and efficient array handling. SISAL outputs a dataflow graph in Intermediary Form 1...

 (Streams and Iteration in a Single Assignment Language).

Applications

Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications (such as image, video and digital signal processing) but less so for general purpose processing with more randomized data access (such as databases). By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 design may be tuned for maximum efficiency or a trade-off for flexibility.

Stream processing is especially suitable for applications that exhibit three application characteristics:
  • Compute Intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity.
  • Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records.
  • Data Locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.



Comparison to prior parallel paradigms

Basic computers started from a sequential execution paradigm. Traditional CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

s are SISD
SISD
In computing, SISD is a term referring to a computer architecture in which a single processor, a uniprocessor, executes a single instruction stream, to operate on data stored in a single memory. This corresponds to the von Neumann architecture.SISD is one of the four main classifications as...

 based, which means they conceptually perform only one operation at a time.
As the computing needs of the world evolved, the amount of data to be managed increased very quickly. It was obvious that the sequential programming model could not cope with the increased need for processing power. Various efforts have been spent on finding alternative ways to perform massive amounts of computations but the only solution was to exploit some level of parallel execution.
The result of those efforts was SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

, a programming paradigm which allowed applying one instruction to multiple instances of (different) data. Most of the time, SIMD was being used in a SWAR
SWAR
SWAR is an acronym for SIMD Within A Register.SIMD, in turn, stands for Single Instruction, Multiple Data.Many modern general-purpose computer processors have some provisions for SIMD, in the form of a group of registers and instructions to make use of them...

 environment. By using more complicated structures, one could also have MIMD
MIMD
In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data...

 parallelism.

Although those two paradigms were efficient, real-world implementations were plagued with limitations from memory alignment problems to synchronization issues and limited parallelism. Only few SIMD processors survived as stand-alone components; most were embedded in standard CPUs.

Consider a simple program adding up two arrays containing 100 4-component vectors (i.e. 400 numbers in total).

Conventional, sequential paradigm


for(int i = 0; i < 100 * 4; i++)
result[i] = source0[i] + source1[i];

This is the sequential paradigm that is most familiar. Variations do exist (such as inner loops, structures and such) but they ultimately boil down to that.

Parallel SIMD paradigm, packed registers (SWAR)


for(int el = 0; el < 100; el++) // for each vector
vector_sum(result[el], source0[el], source1[el]);

This is actually oversimplified. It assumes the instruction vector_sum works. Although this is what happens with instruction intrinsics
Intrinsic function
In compiler theory, an intrinsic function is a function available for use in a given language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function...

, much information is actually not taken into account here such as the number of vector components and their data format. This is done for clarity.

You can see however, this method reduces the number of decoded instructions from numElements * componentsPerElement to numElements. The number of jump instructions is also decreased, as the loop is run fewer times. These gains result from the parallel execution of the four mathematical operations.

What happened however is that the packed SIMD register holds a certain amount of data so it's not possible to get more parallelism. The speed up is somewhat limited by the assumption we made of performing four parallel operations (please note this is common for both AltiVec
AltiVec
AltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...

 and SSE
Streaming SIMD Extensions
In computing, Streaming SIMD Extensions is a SIMD instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow! . SSE contains 70 new instructions, most of which work on single precision floating point...

).

Parallel Stream paradigm (SIMD/MIMD)


// This is a fictional language for demonstration purposes.
elements = array streamElement([number,number])[100]
kernel = instance streamKernel("@arg0[@iter]")
result = kernel.invoke(elements)

As you can see, the idea is to define the whole set of data instead of each single block. Describing the set of data is assumed to be in the first two rows. After that, the result is inferred from the sources and kernel. For simplicity, there's a 1:1 mapping between input and output data but this does not need to be. Applied kernels can also be much more complex.

An implementation of this paradigm can "unroll" a loop internally. This allows throughput to scale with chip complexity, easily utilizing hundreds of ALUs. The elimination of complex data patterns makes much of this extra power available.

While stream processing is a branch of SIMD/MIMD processing, they must not be confused, although SIMD implementations can often work in a "streaming" manner, their performance is not comparable: the model envisions a much different usage pattern which allows far greater performance by itself.
It has been noted that when applied on generic processors such as standard CPU, only a 1.5x speedup can be reached. By contrast, ad-hoc stream processors easily reach over 10x performance, mainly attributed to the more efficient memory access and higher levels of parallel processing.

Although there are various degrees of flexibility allowed by the model, stream processors usually impose some limitations on the kernel or stream size. For example, consumer hardware often lacks the ability to perform high-precision math, lacks complex indirection chains or presents limits on the number of instructions which can be executed.

Stream processing considerations

Available documentation on Stream processing is very scarce as of this writing (September 12, 2005). Only a few specialized institutions seem to have understood the implied power of the model.
Stanford University
Stanford University
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...

 has been historically involved in a variety of projects on this, beginning from the Stanford Shading language and deploying a flexible, stand-alone stream processor called Imagine. Both those projects revealed the paradigm has a great potential so a much larger scale project has been started. With the name of Merrimac, a Stream-based supercomputer is now being researched.
AT&T
AT&T
AT&T Inc. is an American multinational telecommunications corporation headquartered in Whitacre Tower, Dallas, Texas, United States. It is the largest provider of mobile telephony and fixed telephony in the United States, and is also a provider of broadband and subscription television services...

 also recognized the wide adoption of stream-enhanced processors as GPU
Graphics processing unit
A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...

s rapidly evolved in both speed and functionality.

Data dependencies and parallelism

A great advantage of the stream programming model lies in the kernel defining independent and local data usage.

Kernel operations define the basic data unit, both as input and output. This allows the hardware to better allocate resources and schedule global I/O. Although usually not exposed in the programming model, the I/O operations seems to be much more advanced on stream processors (at least, on GPUs). I/O operations are also usually pipelined by themselves while chip structure can help hide latencies.
Definition of the data unit is usually explicit in the kernel, which is expected to have well-defined inputs (possibly using structures, which is encouraged) and outputs. In some environments, output values are fixed (in GPUs for example, there is a fixed set of output attributes, unless this is relaxed).
Having each computing block clearly independent and defined allows to schedule bulk read or write operations, greatly increasing cache and memory bus efficiency.

Data locality is also explicit in the kernel. This concept is usually referred to as kernel locality, identifying all the values which are short-lived to a single kernel invocation. All the temporaries are simply assumed to be local to each kernel invocation so, hardware or software can easily allocate them on fast registers. This is strictly related to degree of parallelism that can be exploited.

Inside each kernel, producer-consumer relationships can be individuated by usual means while, when kernels are chained one after the another, this relationship is given by the model.
This allows easier scheduling decisions because it's clear that if kernel B requires output from kernel A, it's obvious that A must be completed before B can be run (at least on the data unit being used).
The Imagine chip's on-board stream controller module manages kernel loads and execution in hardware at runtime keeping a scoreboard of kernel dependencies (as told by the compiler) and can allow out-of-order execution to minimize stalls producer-consumer locality. This is another major new paradigm for high performance processing. The Cell processor allows this by routing data between various SPEs for example. In comparison, since the Imagine is a pure SIMD machine, inter-cluster communication and kernel execution is always explicit with much lower silicon overhead than a MIMD machine, such as Cell. Imagine uses 8 clusters (a.k.a lanes) of ALUs (similar to Cell's SPEs), but the clusters run in data-parallel mode executing a single kernel at a time. Task switching is done using conventional time-multiplexing. There is only one instruction decode for instance. The tradeoff here is that for kernels that can exploit lower levels of data-parallelism, the efficiency drops as not all clusters will do useful work. For a vast majority of DSP processing though this trade off pays off very well.

The parallelism between two kernel instances is similar to a thread
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

 level parallelism. Each kernel instance gets data parallelism. Inside each kernel, it is still possible to use instruction level parallelism. Task parallelism (such as overlapped I/O) can still happen. It's easy to have thousands of kernel instances but it's simply impossible to have the same amounts of threads.

Programming model notes

The most immediate challenge in the realm of parallel processing does not lie as much in the type of hardware architecture used, but in how easy it will be to program the system in question in a real-world environment with acceptable performance. Machines like Imagine use a straightforward single-threaded model with automated dependencies, memory allocation and DMA scheduling. This in itself is a result of the research at MIT and Stanford in finding an optimal layering of tasks between programmer, tools and hardware. Programmers beat tools in mapping algorithms to parallel hardware, and tools beat programmers in figuring out smartest memory allocation schemes, etc. Of particular concern are MIMD designs such as Cell, for which the programmer needs to deal with application partitioning across multiple cores and deal with process synchronization and load balancing. Efficient multi-core programming tools are severely lacking today.

A drawback of SIMD programming was the issue of Array-of-Structures (AoS) and Structure-of-Arrays (SoA). Programmers often wanted to build data structures with a 'real' meaning, for example:

// A particle in a three dimensional space.
struct particle_t {
float x, y, z; // not even an array!
unsigned byte color[3]; // 8 bit per channel, say we care about RGB only
float size;
// ... and many other attributes may follow...
};

What happened is that those structures were then assembled in arrays to keep things nicely organized. This is AoS.
When the structure is laid out in memory, the compiler will produce interleaved data, in the sense that all the structures will be contiguous but there will be a constant offset between, say, the "size" attribute of a structure instance and the same element of the following instance. The offset depends on the structure definition (and possibly other things not considered here such as compiler's policies).
There are also other problems. For example, the three position variables cannot be SIMD-ized that way, because it's not sure they will be allocated in continuous memory space. To make sure SIMD operations can work on them, they shall be grouped in a 'packed memory location' or at least in an array.
Another problem lies in both "color" and "xyz" to be defined in three-component vector quantities. SIMD processors usually have support for 4-component operations only (with some exceptions however).

These kinds of problems and limitations made SIMD acceleration on standard CPUs quite nasty.
The proposed solution, SoA follows as:

struct particle_t {
float *x, *y, *z;
unsigned byte *colorRed, *colorBlue, *colorGreen;
float *size;
};

For readers not experienced with C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

, the '*' before each identifier means a pointer. In this case, they will be used to point to the first element of an array, which is to be allocated later. For Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

 programmers, this is roughly equivalent to "[]".
The drawback here is that the various attributes could be spread in memory. To make sure this does not cause cache misses, we'll have to update all the various "reds", then all the "greens" and "blues".
Although this is not so bad after all, it's simply overkill when compared to what most stream processors offer.

For stream processors, the usage of structures is encouraged. From an application point of view, all the attributes can be defined with some flexibility.
Taking GPUs as reference, there is a set of attributes (at least 16) available. For each attribute, the application can state the number of components and the format of the components (but only primitive data types are supported for now). The various attributes are then attached to a memory block, possibly defining a stride between 'consecutive' elements of the same attributes, effectively allowing interleaved data.
When the GPU begins the stream processing, it will gather all the various attributes in a single set of parameters (usually this looks like a structure or a "magic global variable"), performs the operations and scatters the results to some memory area for later processing (or retrieving).

Summing up, there's more flexibility on the application's side yet everything looks very organized on the stream processor's side.

Models-of-Computation for Stream Processing

Apart from specifying streaming applications in high-level language. Models-of-Computation (MoCs) also have been widely used such as dataflow models and process-based models.

Generic processor architecture

Historically, CPUs began implementing various tiers of memory access optimizations because of the ever increasing performance when compared to relatively slow growing external memory bandwidth. As this gap widened, big amounts of die area were dedicated to hiding memory latencies. Since fetching information and opcodes to those few ALUs is expensive, very little die area is dedicated to actual mathematical machinery (as a rough estimation, consider it to be less than 10%).

A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistors dedicated to management is actually very little.

Beginning from a whole system point of view, stream processors usually exist in a controlled environment. GPUs do exist on an add-in board (this seems to also apply to Imagine). CPUs do the dirty job of managing system resources, running applications and such.

The stream processor is usually equipped with a fast, efficient, proprietary memory bus (crossbar switches are now common, multi-buses have been employed in the past). The exact amount of memory lanes is dependent on the market range. As this is written, there are still 64-bit wide interconnections around (entry-level). Most mid-range models use a fast 128-bit crossbar switch matrix (4 or 2 segments), while high-end models deploy huge amounts of memory (actually up to 512MB) with a slightly slower crossbar that is 256 bits wide. By contrast, standard processors from Intel Pentium to some Athlon 64
Athlon 64
The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...

 have only a single 64-bit wide data bus.

Memory access patterns are much more predictable. While arrays do exist, their dimension is fixed at kernel invocation. The thing which most closely matches a multiple pointer indirection is an indirection chain, which is however guaranteed to finally read or write from a specific memory area (inside a stream).

Because of the SIMD nature of the stream processor's execution units (ALUs clusters), read/write operations are expected to happen in bulk, so memories are optimized for high bandwidth rather than low latency (this is a difference from Rambus
Rambus
Rambus Incorporated , founded in 1990, is a technology licensing company. The company became well known for its intellectual property based litigation following the introduction of DDR-SDRAM memory.- History :...

 and DDR SDRAM
DDR SDRAM
Double data rate synchronous dynamic random access memory is a class of memory integrated circuits used in computers. DDR SDRAM has been superseded by DDR2 SDRAM and DDR3 SDRAM, neither of which are either forward or backward compatible with DDR SDRAM, meaning that DDR2 or DDR3 memory modules...

, for example). This also allows for efficient memory bus negotiations.

Most (90%) of a stream processor's work is done on-chip, requiring only 1% of the global data to be stored to memory. This is where knowing the kernel temporaries and dependencies pays.

Internally, a stream processor features some clever communication and management circuits but what's interesting is the Stream Register File (SRF). This is conceptually a large cache in which stream data is stored to be transferred to external memory in bulks. As a cache-like software-controlled structure to the various ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s, the SRF is shared between all the various ALU clusters. The key concept and innovation here done with Stanford's Imagine chip is that the compiler is able to automate and allocate memory in an optimal way, fully transparent to the programmer. The dependencies between kernel functions and data is known through the programming model which enables the compiler to perform flow analysis and optimally pack the SRFs. Commonly, this cache and DMA management can take up the majority of a project's schedule, something the stream processor (or at least Imagine) totally automates. Tests done at Stanford showed that the compiler did an as well or better job at scheduling memory than if you hand tuned the thing with much effort.

There is proof, there can be only a lot of clusters because inter-cluster communication is assumed to be rare. Internally however, each cluster can efficiently exploit a much lower amount of ALUs because inter-cluster communication is common and thus needs to be highly efficient.

To keep those ALUs fetched with data, each ALU is equipped with Local Register Files (LRFs), which are basically its usable registers.

This three-tiered data access pattern, makes it easy to keep temporary data away from slow memories, thus making the silicon implementation highly efficient and power-saving.

Hardware-in-the-loop issues


Although an order of magnitude speedup can be reasonably expected (even from mainstream GPUs when computing in a streaming manner), not all applications benefit from this.
Communication latencies are actually the biggest problem. Although PCI Express
PCI Express
PCI Express , officially abbreviated as PCIe, is a computer expansion card standard designed to replace the older PCI, PCI-X, and AGP bus standards...

 improved this with full-duplex communications, getting a GPU (and possibly a generic stream processor) to work will possibly take long amounts of time. This means it's usually counter-productive to use them for small datasets. Because changing the kernel is a rather expensive operation the stream architecture also incurs penalties for small streams, a behaviour referred to as the short stream effect.

Pipelining
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

 is a very widespread and heavily-used practice on stream processors, with GPUs featuring pipelines exceeding 200 stages. The cost for switching settings is dependent on the setting being modified but it's now considered to always be expensive. To avoid those problems at various levels of the pipeline, many techniques have been deployed such as "über shaders" and "texture atlases". Those techniques are game-oriented because of the nature of GPUs, but the concepts are interesting for generic stream processing as well.

Notable Stream Processors

  • Imagine, headed by Professor William Dally of Stanford University
    Stanford University
    The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...

    , is a flexible architecture intended to be both fast and energy efficient. The project, originally conceived in 1996, included architecture, software tools, a VLSI implementation and a development board, was funded by DARPA, Intel and Texas Instruments
    Texas Instruments
    Texas Instruments Inc. , widely known as TI, is an American company based in Dallas, Texas, United States, which develops and commercializes semiconductor and computer technology...

    .
  • Another Stanford project called Merrimac is aimed at developing a stream-based supercomputer. Merrimac intends to use a stream architecture and advanced interconnection networks to provide more performance per unit cost than cluster-based scientific computers built from the same technology.
  • The Storm-1 Family from Stream Processors, Inc
    Stream Processors, Inc
    Stream Processors, Inc was a Silicon Valley-based fabless semiconductor companyspecializing in the design and manufacture of high-performance digital signal processors for applications including video...

    , a commercial spinoff of Stanford's Imagine project, was announced during a feature presentation at ISSCC 2007. The family contains four members ranging from 30 GOPS to 220 16-bit GOPS (billions of operations per second), all fabricated at TSMC
    TSMC
    Taiwan Semiconductor Manufacturing Company, Limited or TSMC is the world's largest dedicated independent semiconductor foundry, with its headquarters and main operations located in the Hsinchu Science Park in Hsinchu, Taiwan.-Overview:...

     in a 130 nanometer process. The devices target the high end of the DSP
    Digital signal processor
    A digital signal processor is a specialized microprocessor with an architecture optimized for the fast operational needs of digital signal processing.-Typical characteristics:...

     market including video conferencing, multifunction printer
    Multifunction printer
    An MFP , multifunctional, all-in-one , or Multifunction Device , is an office machine which incorporates the functionality of multiple devices in one, so as to have a smaller footprint in a home or small business setting , or to provide centralized document...

    s and digital video surveillance equipment.
  • GPUs are widespread, consumer-grade stream processors designed mainly by AMD and Nvidia
    NVIDIA
    Nvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...

    . Various generations to be noted from a stream processing point of view:
    • Pre-R2xx/NV2x: no explicit support for stream processing. Kernel operations were hidden in the API and provided too little flexibility for general use.
    • R2xx/NV2x: kernel stream operations became explicitly under the programmer's control but only for vertex processing (fragments were still using old paradigms). No branching support severely hampered flexibility but some types of algorithms could be run (notably, low-precision fluid simulation).
    • R3xx/NV4x: flexible branching support although some limitations still exist on the number of operations to be executed and strict recursion depth, as well as array manipulation.
    • R8xx: Supports append/consume buffers and atomic operations. This generation is the state of the art.
  • The Cell processor from STI, an alliance of Sony Computer Entertainment
    Sony Computer Entertainment
    Sony Computer Entertainment, Inc. is a major video game company specializing in a variety of areas in the video game industry, and is a wholly owned subsidiary and part of the Consumer Products & Services Group of Sony...

    , Toshiba Corporation, and IBM
    IBM
    International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

    , is a hardware architecture that can function like a stream processor with appropriate software support. It consists of a controlling processor, the PPE (Power Processing Element, an IBM PowerPC
    PowerPC
    PowerPC is a RISC architecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM...

    ) and a set of SIMD coprocessors, called SPEs (Synergistic Processing Elements), each with independent program counters and instruction memory, in effect a MIMD
    MIMD
    In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data...

     machine. In the native programming model all DMA and program scheduling is left up to the programmer. The hardware provides a fast ring bus among the processors for local communication. Because the local memory for instructions and data is limited the only programs that can exploit this architecture effectively either require a tiny memory footprint or adhere to a stream programming model. With a suitable algorithm the performance of the Cell can rival that of pure stream processors, however this nearly always requires a complete redesign of algorithms and software.

Stream Programming Languages

Most programming languages for stream processors start with Java, C or C++ and add extensions which provide specific instructions to allow application developers to tag kernels and/or streams. This also applies to most shading language
Shading language
A shading language is a special programming language adapted to map on shader programming. Those kind of languages usually have special data types like color and normal...

s, which can be considered stream programming languages to a certain degree.

Non-commercial examples of stream programming languages include:
  • Ateji PX
    Ateji PX
    Ateji PX is an object-oriented programming language extension for Java. It is intended to facilliate parallel computing on multi-core processors, GPU, Grid and Cloud....

     Free Edition, enables a simple expression of stream programming, the Actor model, and the MapReduce algorithm on JVM
  • Auto-Pipe, application development environment for streaming applications allows authoring of applications for heterogeneous systems (CPU, GPGPU, FPGA). Applications can be developed in any combination of C, C++, and Java for the CPU. Verilog or VHDL for FPGAs. Cuda is currently used for Nvidia GPGPUs. Auto-Pipe also handles coordination of TCP connections between multiple machines.
  • ACOTES Programming Model: language from Polytechnic University of Catalonia based on OpenMP
    OpenMP
    OpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems, including Linux, Unix, AIX, Solaris, Mac OS X, and Microsoft Windows platforms...

  • Brook language from Stanford
  • DUP language from Technical University of Munich
    Technical University of Munich
    The Technische Universität München is a research university with campuses in Munich, Garching, and Weihenstephan...

     and University of Denver
    University of Denver
    The University of Denver is currently ranked 82nd among all public and private "National Universities" by U.S. News & World Report in the 2012 rankings....

  • OpenCL
    OpenCL
    OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language for writing kernels , plus APIs that are used to define and then control the platforms...

    , an open standard which currently runs on most hardware, including X86-CPUs, X86-GPUs, ARM-GPUs and FPGAs.
  • Sh
    Lib Sh
    Sh is a metaprogramming language for programmable GPUs. Programmable GPUs are graphics processing units that execute some operations with higher efficiency than CPUs...

     library from the University of Waterloo
    University of Waterloo
    The University of Waterloo is a comprehensive public university in the city of Waterloo, Ontario, Canada. The school was founded in 1957 by Drs. Gerry Hagey and Ira G. Needles, and has since grown to an institution of more than 30,000 students, faculty, and staff...

  • Shallows, an open source project
  • S-Net coordination language from the University of Hertfordshire
    University of Hertfordshire
    The University of Hertfordshire is a new university based largely in Hatfield, in the county of Hertfordshire, England, from which the university takes its name. It has more than 27,500 students, over 2500 staff, with a turnover of over £181m...

    , which provides separation of coordination and algorithmic programming
  • StreamIt from MIT
  • WaveScript Functional stream processing, also from MIT.
  • Functional reactive programming
    Functional reactive programming
    Functional reactive programming is a programming paradigm for reactive programming using the building blocks of functional programmingThe key points of FRP are:* Input is viewed as a "behavior", or time-varying stream of events...

     could be considered stream processing in a broad sense.


Commercial implementations are either general purpose or tied to specific hardware by a vendor. Examples of general purpose languages include:
  • AccelerEyes
    AccelerEyes
    AccelerEyes builds programming tools for parallel programming and visual computing on GPU chipsets. Based in Atlanta, the company released Jacket, a tool to compile MATLAB code for CUDA-enabled GPUs, in June 2008...

    ' Jacket, a commercialization of a GPU engine for MATLAB
  • Ateji PX
    Ateji PX
    Ateji PX is an object-oriented programming language extension for Java. It is intended to facilliate parallel computing on multi-core processors, GPU, Grid and Cloud....

     Java extension that enables a simple expression of stream programming, the Actor model, and the MapReduce algorithm
  • Floodgate, a stream processor provided with the Gamebryo
    Gamebryo
    Gamebryo is a game engine, originally from Numerical Design Limited , and the successor to NDL's NetImmerse engine.Since the creation of Gamebryo, NDL merged into Emergent Game Technologies...

     game engine for PlayStation3, Xbox360, Wii, and PC
  • HMPP Open Standard
    HMPP Open Standard
    HMPP for Hybrid Multicore Parallel Programming.Based on a set of directives, OpenHMPP Standard is a programming model designed to handle hardware accelerators without the complexity associated with GPU programming...

    , a "directive" vision of Many-Core programming
  • PeakStream, a spinout of the Brook
    BrookGPU
    BrookGPU is the Stanford University graphics group's compiler and runtime implementation of the Brook stream programming language for using modern graphics hardware for non-graphical, general purpose computations...

     project (acquired by Google in June 2007)
  • IBM Spade - Stream Processing Application Declarative Engine (B. Gedik, et al. SPADE: the system S declarative stream processing engine. ACM SIGMOD 2008.)
  • RapidMind
    RapidMind
    RapidMind Inc. was a privately held company founded and headquartered in Waterloo, Ontario, Canada, acquired by Intel in 2009. It provided a software product that aims to make it simpler for software developers to target multi-core processors and accelerators such as GPUs.-History:RapidMind was...

    , a commercialization of Sh
    Lib Sh
    Sh is a metaprogramming language for programmable GPUs. Programmable GPUs are graphics processing units that execute some operations with higher efficiency than CPUs...

     (acquired by Intel in August 2009)
  • TStreams, Hewlett-Packard Cambridge Research Lab


Vendor-specific languages include:
  • Brook+ (AMD hardware optimized implementation of Brook
    BrookGPU
    BrookGPU is the Stanford University graphics group's compiler and runtime implementation of the Brook stream programming language for using modern graphics hardware for non-graphical, general purpose computations...

    ) from AMD/ATI
    ATI Technologies
    ATI Technologies Inc. was a semiconductor technology corporation based in Markham, Ontario, Canada, that specialized in the development of graphics processing units and chipsets. Founded in 1985 as Array Technologies Inc., the company was listed publicly in 1993 and was acquired by Advanced Micro...

  • CUDA
    CUDA
    CUDA or Compute Unified Device Architecture is a parallel computing architecture developed by Nvidia. CUDA is the computing engine in Nvidia graphics processing units that is accessible to software developers through variants of industry standard programming languages...

     (Compute Unified Device Architecture) from Nvidia
    NVIDIA
    Nvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...

  • Intel Ct
    Intel Ct
    Intel Ct is a programming model developed by Intel to ease the exploitation of its future multicore chips, as demonstrated by the Tera-Scale research program.It is based on the exploitation of SIMD to produce automatically parallelized programs....

     - C for Throughput Computing
  • StreamC from Stream Processors, Inc
    Stream Processors, Inc
    Stream Processors, Inc was a Silicon Valley-based fabless semiconductor companyspecializing in the design and manufacture of high-performance digital signal processors for applications including video...

    , a commercialization of the Imagine work at Stanford

See also

  • GPGPU
    GPGPU
    General-purpose computing on graphics processing units is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the CPU...

  • MIMD
    MIMD
    In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data...

  • Parallel computing
    Parallel computing
    Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

  • Molecular modeling on GPU
    Molecular modeling on GPU
    Molecular modeling on GPU is the technique of using a graphics processing unit for molecular simulations.In 2007, NVIDIA introduced video cards that could be used not only to show graphics but also for scientific calculations. These cards include many arithmetic units working in parallel...

  • SIMD
    SIMD
    Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

  • Vector processor
    Vector processor
    A vector processor, or array processor, is a central processing unit that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items...

  • Dataflow
    Dataflow
    Dataflow is a term used in computing, and may have various shades of meaning. It is closely related to message passing.-Software architecture:...

  • Partitioned global address space
    Partitioned global address space
    In computer science, a partitioned global address space is a parallel programming model. It assumes a global memory address space that is logically partitioned and a portion of it is local to each processor. The novelty of PGAS is that the portions of the shared memory space may have an affinity...

  • Streaming algorithm
    Streaming algorithm
    In computer science, streaming algorithms are algorithms forprocessing data streams in which the input is presented as a sequence ofitems and can be examined in only a few passes...

  • Data stream mining
    Data stream mining
    Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records.A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage...

  • Dimension reduction
  • Digital signal processing
    Digital signal processing
    Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

  • Real Time Streaming Protocol
    Real Time Streaming Protocol
    The Real Time Streaming Protocol is a network control protocol designed for use in entertainment and communications systems to control streaming media servers. The protocol is used for establishing and controlling media sessions between end points...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK