SSE2
Encyclopedia
SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

 (Single Instruction, Multiple Data) processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4
Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...

 in 2001. It extends the earlier SSE
Streaming SIMD Extensions
In computing, Streaming SIMD Extensions is a SIMD instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow! . SSE contains 70 new instructions, most of which work on single precision floating point...

 instruction set, and is intended to fully supplant MMX. Intel extended SSE2 to create SSE3
SSE3
SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions , is the third iteration of the SSE instruction set for the IA-32 architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU...

 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Rival chip-maker AMD added support for SSE2 with the introduction of their Opteron
Opteron
Opteron is AMD's x86 server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture . It was released on April 22, 2003 with the SledgeHammer core and was intended to compete in the server and workstation markets, particularly in the same...

 and Athlon 64
Athlon 64
The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...

 ranges of AMD64
X86-64
x86-64 is an extension of the x86 instruction set. It supports vastly larger virtual and physical address spaces than are possible on x86, thereby allowing programmers to conveniently work with much larger data sets. x86-64 also provides 64-bit general purpose registers and numerous other...

 64-bit CPUs in 2003.

Changes

SSE2 extends MMX instructions to operate on XMM registers, allowing the programmer to completely avoid the eight 64-bit MMX registers "aliased" on the original IA-32 floating point register stack. This permits mixing integer SIMD and scalar floating point operations without the mode switching required between MMX and x87
X87
x87 is a floating point-related subset of the x86 architecture instruction set. It originated as an extension of the 8086 instruction set in the form of optional floating point coprocessors that worked in tandem with corresponding x86 CPUs. These microchips had names ending in "87"...

 floating point operations. However, this is over-shadowed by the value of being able to perform MMX operations on the wider SSE registers.

Other SSE2 extensions include a set of cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...

-control instructions intended primarily to minimize cache pollution
Cache pollution
Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other needed data to be evicted from the cache into lower levels of the memory hierarchy, potentially all the way down to main memory, thus causing a performance...

 when processing indefinite streams of information, and a sophisticated complement of numeric format conversion instructions.

AMD's implementation of SSE2 on the AMD64 (x86-64
X86-64
x86-64 is an extension of the x86 instruction set. It supports vastly larger virtual and physical address spaces than are possible on x86, thereby allowing programmers to conveniently work with much larger data sets. x86-64 also provides 64-bit general purpose registers and numerous other...

) platform includes an additional eight registers, doubling the total number to 16 (XMM0 through XMM15). These additional registers are only visible when running in 64-bit mode. Intel adopted these additional registers as part of their support for x86-64 architecture (or in Intel's parlance, "Intel 64") in 2004.

Differences between x87 FPU and SSE2

The FPU (x87) instructions usually store intermediate results with 80 bits of precision. When legacy FPU software algorithms are ported to SSE2, certain combinations of math operations or input datasets can result in measurable numerical deviation: This is of critical importance to scientific computations, if the calculation results must be compared against results generated from a different machine architecture.

Depending on the compiler or interpreter (and optimizations) used, different intermediate results of a given mathematical expression or iterative algorithm may need to be temporarily saved, and later reloaded. SSE2 works with either 32 or 64 bits (4 or 8 bytes) of precision while x87 instructions normally produces 80-bit results in its 80-bit registers (10 bytes). All the 80 bits of an x87 result may be stored in memory, but is nevertheless often rounded to 64 or 32 bits for compatibility with the most common floating point data types. Depending on precision as well as when such roundings are performed, the numerical results may be different. Similar differences can be seen when comparing results from 32 or 64-bit precision SSE2 code with corresponding results of 32, 64, or 80-bit precision x87 code. The following Fortran code compiled with G95
G95
G95 is a free, portable, open source Fortran 95 compiler. It implements the Fortran 95 standard, part of the Fortran 2003 standard and some old and new extensions including proposed features for the Fortran 2008 standard like Co-array Fortran....

 is offered as an example; the exact value of the third and final number printed is zero.

program hi
real a,b,c,d
real x,y,z
a=.013
b=.027
c=.0937
d=.79
y=-a/b + (a/b+c)*EXP(d)
print *,y
z=(-a)/b + (a/b+c)*EXP(d)
print *,z
x=y-z
print *,x
end

Compiling to 387 floating point instructions and running yields:
# g95 -o hi -mfpmath=387 -fzero -ftrace=full -fsloppy-char hi.for
# ./hi
0.78587145
0.7858714
5.9604645E-8

Compiling to SSE2 instructions and running yields:
# g95 -o hi -mfpmath=sse -msse2 -fzero -ftrace=full -fsloppy-char hi.for
# ./hi
0.78587145
0.78587145
0.

Differences between MMX and SSE2

SSE2 extends MMX instructions to operate on XMM registers. Therefore, it is possible to convert all existing MMX code to SSE2 equivalent. Since an XMM register is twice as long as an MMX register, loop counters and memory access may need to be changed to accommodate this.

Although one SSE2 instruction can operate on twice as much data as an MMX instruction, performance might not increase significantly. Two major reasons are: accessing SSE2 data in memory not aligned
Data structure alignment
Data structure alignment is the way data is arranged and accessed in computer memory. It consists of two separate but related issues: data alignment and data structure padding. When a modern computer reads from or writes to a memory address, it will do this in word sized chunks...

 to a 16-byte boundary will incur significant penalty, and the throughput
Throughput
In communication networks, such as Ethernet or packet radio, throughput or network throughput is the average rate of successful message delivery over a communication channel. This data may be delivered over a physical or logical link, or pass through a certain network node...

 of SSE2 instructions in most x86 implementations is usually smaller than MMX instructions. Intel has recently addressed the first problem by adding an instruction in SSE3
SSE3
SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions , is the third iteration of the SSE instruction set for the IA-32 architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU...

 to reduce the overhead of accessing unaligned data, and the last problem by widening the execution engine in their Core microarchitecture.

Compiler usage

When first introduced in 2000, SSE2 was not supported by software development tools. For example, to use SSE2 in a Microsoft Developer Studio
Microsoft Visual Studio
Microsoft Visual Studio is an integrated development environment from Microsoft. It is used to develop console and graphical user interface applications along with Windows Forms applications, web sites, web applications, and web services in both native code together with managed code for all...

 project, the programmer had to either manually write inline-assembly or import object-code from an external source. Later the Visual C++ Processor Pack added SSE2 support to Visual C++
Visual C++
Microsoft Visual C++ is a commercial , integrated development environment product from Microsoft for the C, C++, and C++/CLI programming languages...

 and MASM.

The Intel C++ Compiler
Intel C++ Compiler
Intel C++ Compiler is a group of C and C++ compilers from Intel Corporation available for GNU/Linux, Mac OS X, and Microsoft Windows....

 can automatically generate SSE4/SSSE3/SSE3/SSE2 and/or SSE-code without the use of hand-coded assembly.

Since GCC 3, GCC
GNU Compiler Collection
The GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain...

 can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4.

The Sun Studio Compiler Suite can also generate SSE2 instructions when the compiler flag -xvector=simd is used.

CPUs supporting SSE2

  • AMD K8
    AMD K8
    The AMD K8 is a computer processor microarchitecture designed by AMD as the successor to the AMD K7 microarchitecture. The K8 was the first implementation of the AMD64 64-bit extension to the x86 processor architecture.Processors based on the K8 core include:...

    -based CPUs (Athlon 64
    Athlon 64
    The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...

    , Sempron 64
    Sempron
    Sempron has been the marketing name used by AMD for several different budget desktop CPUs, using several different technologies and CPU socket formats. The Sempron replaced the AMD Duron processor and competes against Intel's Celeron series of processors...

    , Turion 64, etc.)
  • AMD Phenom
    Phenom (processor)
    Phenom is the 64-bit AMD desktop processor line based on the K10 microarchitecture, in what AMD calls family 10h processors, sometimes incorrectly called "K10h". Triple-core versions belong to the Phenom 8000 series and quad cores to the AMD Phenom X4 9000 series...

     CPUs
  • Intel NetBurst-based CPUs (Pentium 4
    Pentium 4
    Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...

    , Xeon
    Xeon
    The Xeon is a brand of multiprocessing- or multi-socket-capable x86 microprocessors from Intel Corporation targeted at the non-consumer server, workstation and embedded system markets.-Overview:...

    , Celeron
    Celeron
    Celeron is a brand name given by Intel Corp. to a number of different x86 computer microprocessor models targeted at budget personal computers....

    , Celeron D, etc.)
  • Intel Pentium M
    Pentium M
    The Pentium M brand refers to a family of mobile single-core x86 microprocessors introduced in March 2003 , and forming a part of the Intel Carmel notebook platform under the then new Centrino brand...

     and Celeron M
  • Intel Core
    Intel Core
    Yonah was the code name for Intel's first generation of 65 nm process mobile microprocessors, based on the Banias/Dothan-core Pentium M microarchitecture. SIMD performance has been improved through the addition of SSE3 instructions and improvements to SSE and SSE2 implementations, while integer...

     family (including Intel Core 2
    Intel Core 2
    Core 2 is a brand encompassing a range of Intel's consumer 64-bit x86-64 single-, dual-, and quad-core microprocessors based on the Core microarchitecture. The single- and dual-core models are single-die, whereas the quad-core models comprise two dies, each containing two cores, packaged in a...

    , Intel Core i5, Intel Core i7)
  • Intel Atom
    Intel Atom
    Intel Atom is the brand name for a line of ultra-low-voltage x86 and x86-64 CPUs from Intel, designed in 45 nm CMOS and used mainly in netbooks, nettops, embedded application ranging from health care to advanced robotics and Mobile Internet devices...

  • Transmeta
    Transmeta
    Transmeta Corporation was a US-based corporation that licensed low power semiconductor intellectual property. Transmeta originally produced very long instruction word code morphing microprocessors, with a focus on reducing power consumption in electronic devices. It was founded in 1995 by Bob...

     Efficeon
    Efficeon
    The Efficeon processor is Transmeta's second-generation 256-bit VLIW design which employs a software engine to convert code written for x86 processors to the native instruction set of the chip...

  • VIA
    VIA Technologies
    VIA Technologies is a Taiwanese manufacturer of integrated circuits, mainly motherboard chipsets, CPUs, and memory, and is part of the Formosa Plastics Group. It is the world's largest independent manufacturer of motherboard chipsets...

     C7
    VIA C7
    The VIA C7 is an x86 central processing unit designed by Centaur Technology and sold by VIA Technologies.- Product history :The C7 delivers a number of improvements to the older VIA C3 cores but is nearly identical to the latest VIA C3 Nehemiah core. The C7 was officially launched in May 2005,...

  • VIA
    VIA Technologies
    VIA Technologies is a Taiwanese manufacturer of integrated circuits, mainly motherboard chipsets, CPUs, and memory, and is part of the Formosa Plastics Group. It is the world's largest independent manufacturer of motherboard chipsets...

     Nano
    VIA Nano
    The VIA Nano is a 64-bit CPU for personal computers. The VIA Nano was released by VIA Technologies in 2008 after five years of development by its CPU division, Centaur Technology...


Notable IA-32 CPUs not supporting SSE2

SSE2 is an extension of the IA-32
IA-32
IA-32 , also known as x86-32, i386 or x86, is the CISC instruction-set architecture of Intel's most commercially successful microprocessors, and was first implemented in the Intel 80386 as a 32-bit extension of x86 architecture...

 architecture. Therefore any architecture that does not support IA-32 does not support SSE2. x86-64
X86-64
x86-64 is an extension of the x86 instruction set. It supports vastly larger virtual and physical address spaces than are possible on x86, thereby allowing programmers to conveniently work with much larger data sets. x86-64 also provides 64-bit general purpose registers and numerous other...

 CPUs all implement IA-32
IA-32
IA-32 , also known as x86-32, i386 or x86, is the CISC instruction-set architecture of Intel's most commercially successful microprocessors, and was first implemented in the Intel 80386 as a 32-bit extension of x86 architecture...

. All known x86-64
X86-64
x86-64 is an extension of the x86 instruction set. It supports vastly larger virtual and physical address spaces than are possible on x86, thereby allowing programmers to conveniently work with much larger data sets. x86-64 also provides 64-bit general purpose registers and numerous other...

 CPUs also implement SSE2. Since IA-32 predates SSE2, early IA-32 CPUs did not implement it. SSE2 and the other SIMD instruction sets were intended primarily to improve CPU support for realtime graphics, notably gaming. A CPU that is not marketed for this purpose or that has an alternative SIMD instruction set has no need for SSE2.

The following CPUs implemented IA-32 after SSE2 was developed, but did not implement SSE2:
  • AMD CPUs prior to Athlon 64
    Athlon 64
    The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...

    , including all Socket A
    Socket A
    Socket A is the CPU socket used for AMD processors ranging from the Athlon Thunderbird to the Athlon XP/MP 3200+, and AMD budget processors including the Duron and Sempron. Socket A also supports AMD Geode NX embedded processors...

    -based CPUs
  • Intel CPUs prior to Pentium 4
    Pentium 4
    Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...

  • VIA C3
    VIA C3
    The VIA C3 is a family of x86 central processing units for personal computers designed by Centaur Technology and sold by VIA Technologies. The different CPU cores are built following the design methodology of Centaur Technology.-Samuel 2 and Ezra cores:...

  • Transmeta Crusoe
    Transmeta Crusoe
    The Crusoe is a family of x86-compatible microprocessors developed by Transmeta. Crusoe was notable for its method of achieving x86 compatibility. Instead of the instruction set architecture being implemented in hardware, or translated by specialized hardware, the Crusoe runs a software abstraction...


See also

  • SSE2 instructions
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK