Cell is a
microprocessorA microprocessor incorporates most or all of the functions of a central processing unit on a single integrated circuit . The first microprocessors emerged in the early 1970s and were used for electronic calculators, using binary-coded decimal arithmetic on 4-bit words...
architecture jointly developed by
Sony Computer Entertainmentis a video game company specializing in a variety of areas in the video game industry, and is a full subsidiary of Sony. The company was established on November 16, 1987 in Tokyo, Japan prior to the launch of the original PlayStation video game system...
,
Toshibais a Japanese multinational conglomerate manufacturing company, headquartered in Tokyo, Japan. The company's main business is in infrastructure, consumer products, electronic devices and components.Toshiba-made Semiconductors are among the Worldwide Top 20 Semiconductor Sales Leaders...
, and
IBMInternational Business Machines Corporation, abbreviated IBM, is a multinational computer technology and IT consulting corporation headquartered in Armonk, Town of North Castle, New York, United States. The company is one of the few information technology companies with a continuous history dating...
, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in
Austin, TexasAustin is the capital of the U.S. state of Texas and the seat of Travis County. Located in Central Texas on the eastern edge of the American Southwest, it is the fourth-largest city in Texas and the 15th-largest in the United States. It was the third-fastest-growing large city in the nation...
over a four-year period beginning March 2001 on a budget reported by Sony as approaching
US$The United States dollar is the unit of currency of the United States. The U.S. dollar is normally abbreviated as the dollar sign, $, or as USD or US$ to distinguish it from other dollar-denominated currencies and from others that use the $ symbol. It is divided into 100 cents .The U.S...
400 million. Cell is shorthand for
Cell Broadband Engine Architecture, commonly abbreviated
CBEA in full or
Cell BE in part. Cell combines a general-purpose
Power ArchitecturePower Architecture is a broad term to describe similar RISC instruction sets for microprocessors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi...
coreA multi-core processor is a processing system composed of two or more independent cores. The cores are typically integrated onto a single integrated circuit die , or they may be integrated onto multiple dies in a single chip package...
of modest performance with streamlined
coprocessingA coprocessor is a computer processor used to supplement the functions of the primary processor . Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, Savitsky-Golay derivation, or encryption. By offloading processor-intensive...
elements which greatly accelerate
multimediaMultimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which only use traditional forms of printed or hand-produced material...
and vector processing applications, as well as many other forms of dedicated computation.
The first major commercial application of Cell was in Sony's
PlayStation 3The PlayStation 3 is the third home video game console produced by Sony Computer Entertainment, and the successor to the PlayStation 2 as part of the PlayStation series...
game consoleA video game console is an interactive entertainment computer or electronic device that produces a video display signal which can be used with a display device to display a video game...
.
Mercury Computer SystemsMercury Computer Systems, Inc. provides high-performance embedded, real-time digital signal and image processing solutions.Mercury designs and builds embedded multicomputers, which may be considered to be either loosely coupled NUMA computers or tightly coupled clusters. Despite being marketed as...
has a dual Cell server, a dual Cell
bladeA blade server is a stripped down server computer with a modular design optimized to minimize the use of physical space. Whereas a standard rack-mount server can function with a power cord and network cable, blade servers have many components removed to save space, minimize power consumption and...
configuration, a rugged computer, and a PCI Express accelerator board available in different stages of production. Toshiba has announced plans to incorporate Cell in
high definitionHigh-definition television is a digital television broadcasting system with higher resolution than traditional television systems...
television sets. Exotic features such as the
XDRXDR DRAM or extreme data rate dynamic random access memory is a high-performance RAM interface and successor to the Rambus RDRAM it is based on, competing with the rival DDR2 SDRAM and GDDR4 technology. XDR was designed to be effective in small, high-bandwidth consumer systems, high-performance...
memory subsystem and coherent Element Interconnect Bus (EIB) interconnect appear to position Cell for future applications in the supercomputing space to exploit the Cell processor's prowess in
floating pointIn computing, floating point describes a system for numerical representation in which a string of digits represents a rational number....
kernels. IBM has announced plans to incorporate Cell processors as add-on cards into IBM System z9 mainframes, to enable them to be used as servers for
MMORPGMassively multiplayer online role-playing game is a genre of computer role-playing games in which a very large number of players interact with one another within a virtual game world....
s.
The Cell architecture includes a novel
memory coherenceMemory coherence is an issue that affects the design of computer systems in which two or more processors share a common area of memory.A computer system does useful work by reading data from permanent storage into memory, performing some operation on that data and then storing the result back to...
architecture for which IBM received many
patentA patent is a set of exclusive rights granted by a state to an inventor or their assignee for a limited period of time in exchange for a public disclosure of an invention....
s. The architecture emphasizes efficiency/watt, prioritizes
bandwidthIn computer networking and computer science, digital bandwidth, network bandwidth or just bandwidth is a measure of available or consumed data communication resources expressed in bit/s or multiples of it ....
over
latencyLatency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured.-Packet-switched networks:...
, and favors peak computational
throughputIn communication networks, such as Ethernet or packet radio, throughput or network throughput is the average rate of successful message delivery over a communication channel. These data may be delivered over a physical or logical link, or pass through a certain network node...
over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for
software developmentSoftware engineering is the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software....
. IBM provides a comprehensive
LinuxLinux is a generic term referring to Unix-like computer operating systems based on the Linux kernel. Their development is one of the most prominent examples of free and open source software collaboration; typically all the underlying source code can be used, freely modified, and redistributed,...
-based Cell development platform to assist developers in confronting these challenges. Software adoption remains a key issue in whether Cell ultimately delivers on its performance potential. Despite those challenges, research has indicated that Cell excels at several types of scientific computation.
In November 2006, the
College of ComputingThe College of Computing at the Georgia Institute of Technology has roots stretching back to an Information Science degree established in 1964...
at
Georgia TechThe Georgia Institute of Technology, commonly called Georgia Tech, Tech, and GT, is a public, coeducational research university in Atlanta, Georgia in the United States...
was selected by IBM, Sony, and Toshiba from more than a dozen universities to be designated as the first
STI Center of Competence for the Cell ProcessorThe Sony Toshiba IBM Center of Competence for the Cell Processor is the first Center of Competence dedicated to the promotion and development of Sony Toshiba IBM's Cell microprocessor, an eight-core multiprocessor designed using principles of parallelism and memory latency. The center is part of...
. This partnership is designed to build a community of programmers and broaden industry support for the Cell processor. There is a Cell Programming tutorial video available from them.
History
In mid-2000,
Sony Computer Entertainmentis a video game company specializing in a variety of areas in the video game industry, and is a full subsidiary of Sony. The company was established on November 16, 1987 in Tokyo, Japan prior to the launch of the original PlayStation video game system...
, Toshiba Corporation, and
IBMInternational Business Machines Corporation, abbreviated IBM, is a multinational computer technology and IT consulting corporation headquartered in Armonk, Town of North Castle, New York, United States. The company is one of the few information technology companies with a continuous history dating...
formed an alliance known as "STI" to design and manufacture the processor.
The STI Design Center opened in March 2001. The Cell was designed over a period of four years, using enhanced versions of the design tools for the
POWER4The POWER4 is a microprocessor developed by International Business Machines that implemented the 64-bit PowerPC and PowerPC AS instruction set architectures. Released in 2001, the POWER4 succeeded the POWER3 and RS64 microprocessors, and was used in RS/6000 and AS/400 computers, ending a separate...
processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM's design centers.
During this period, IBM filed many patents pertaining to the Cell architecture, manufacturing process, and software environment. An early patent version of the Broadband Engine was shown to be a chip package comprising four "Processing Elements," which was the patent's description for what is now known as the
Power Processing Element. Each Processing Element contained 8
APUs, which are now referred to as SPEs on the current Broadband Engine chip. Said chip package was widely regarded to run at a clock speed of 4 GHz and with 32 APUs providing 32
GFLOPSIn computing, FLOPS is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations, similar to the older, simpler, instructions per second...
each, the Broadband Engine was shown to have 1 teraflop of raw computing power. This design was fabricated using a 90 nm
SOISilicon on insulator technology refers to the use of a layered silicon-insulator-silicon substrate in place of conventional silicon substrates in semiconductor manufacturing, especially microelectronics, to reduce parasitic device capacitance and thereby improving performance...
process.
In March 2007 IBM announced that the 65 nm version of Cell BE is in production at its plant in
East Fishkill, New YorkEast Fishkill is a town on the southern border of Dutchess County, New York, United States. The population was 25,589 at the 2000 census. The town name is derived from its formation from Fishkill, NY....
.
Again in February 2008, IBM announced that it will begin to fabricate Cell processors with the 45 nm process.
In May 2008, IBM introduced the high-performance double-precision floating-point version of the Cell processor, the PowerXCell 8i, at the 65 nm feature size.
In May 2008, an
OpteronThe Opteron is AMD's x86 server and workstation processor line, and was the first processor to implement the AMD64 instruction set architecture . It was released on April 22, 2003 with the SledgeHammer core and was intended to compete in the server and workstation markets, particularly in the same...
- and Cell-BE-based supercomputer, the IBM Roadrunner system, became the world's first system to achieve one petaFLOPS. The Cell BE-based Roadrunner system is currently the worlds fastest supercomputer as represented by the
Top500The TOP500 project ranks and details the 500 most powerful known computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year...
list. The world's three most energy efficient supercomputers, as represented by the Green500 list, are similarly based on the PowerXCell 8i.
The 45 nm Cell processor was introduced in concert with Sony's PlayStation 3 Slim in August 2009.
Commercialization
On May 17, 2005, Sony Computer Entertainment confirmed some specifications of the Cell processor that would be shipping in the forthcoming
PlayStation 3The PlayStation 3 is the third home video game console produced by Sony Computer Entertainment, and the successor to the PlayStation 2 as part of the PlayStation series...
console. This Cell configuration will have one Power processing element (PPE) on the core, with eight physical SPEs in silicon. In the PlayStation 3, one SPE is locked-out during the test process, a practice which helps to improve manufacturing yields, and another one is reserved for the OS, leaving 6 free SPEs to be used by games' code. The target clock-frequency at introduction is 3.2
GHzGHZ or GHz may refer to:# Gigahertz .# Greenberger-Horne-Zeilinger state - a quantum entanglement of three particles.# Galactic Habitable Zone - the region of a galaxy that is favorable to the formation of life....
. The introductory design is fabricated using a 90-nanometer
SOISilicon on insulator technology refers to the use of a layered silicon-insulator-silicon substrate in place of conventional silicon substrates in semiconductor manufacturing, especially microelectronics, to reduce parasitic device capacitance and thereby improving performance...
process, with initial volume production slated for IBM's facility in
East Fishkill, New YorkEast Fishkill is a town on the southern border of Dutchess County, New York, United States. The population was 25,589 at the 2000 census. The town name is derived from its formation from Fishkill, NY....
.
Note that the relationship between
coresA multi-core processor is a processing system composed of two or more independent cores. The cores are typically integrated onto a single integrated circuit die , or they may be integrated onto multiple dies in a single chip package...
and
threadsIn computer science, a thread of execution results from a fork of a computer program into two or more concurrently running tasks. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
is a common source of confusion. The PPE core is
dual threadedSimultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...
and manifests in software as two independent threads of execution while each active SPE manifests as a single thread. In the PlayStation 3 configuration as described by Sony, the Cell processor provides nine independent threads of execution.
On June 28 2005, IBM and
Mercury Computer SystemsMercury Computer Systems, Inc. provides high-performance embedded, real-time digital signal and image processing solutions.Mercury designs and builds embedded multicomputers, which may be considered to be either loosely coupled NUMA computers or tightly coupled clusters. Despite being marketed as...
announced a partnership agreement to build Cell-based computer systems for
embeddedAn embedded system is a computer system designed to perform one or a few dedicated functions , often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. In contrast, a general-purpose computer, such as a personal...
applications such as
medical imagingMedical imaging refers to the techniques and processes used to create images of the human body for clinical purposes or medical science .As a discipline and in its widest sense, it is part of biological imaging and incorporates...
, industrial inspection,
aerospaceAerospace comprises the atmosphere of Earth and surrounding space. Typically the term is used to refer to the industry that researches, designs, manufactures, operates, and maintains vehicles moving through air and space...
and
defenseDefense has several uses in the sphere of military application.Personal defense implies measures taken by individual soldiers in protecting themselves whether by use of protective materials such as armor, or field construction of trenches or a bunker, or by using weapons that prevent the enemy...
,
seismic processingReflection seismology is a method of exploration geophysics that uses the principles of seismology to estimate the properties of the Earth's subsurface from reflected seismic waves. The method requires a controlled seismic source of energy, such as dynamite/Tovex, a specialized air gun or...
, and telecommunications. Mercury has since then released
bladesA blade server is a stripped down server computer with a modular design optimized to minimize the use of physical space. Whereas a standard rack-mount server can function with a power cord and network cable, blade servers have many components removed to save space, minimize power consumption and...
, conventional
rack serversA 19-inch rack is a standardized frame or enclosure for mounting multiple equipment modules. Each module has a front panel that is wide, including edges or ears that protrude on each side which allow the module to be fastened to the rack frame with screws.-Overview and history:Equipment designed...
and
PCI ExpressPCI Express , officially abbreviated as PCIe , is a computer expansion card standard designed to replace the older PCI, PCI-X, and AGP standards...
accelerator boards with Cell processors.
In the fall of 2006, IBM released the QS20 blade module using double Cell BE processors for tremendous performance in certain applications, reaching a peak of 410
gigaFLOPSIn computing, FLOPS is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations, similar to the older, simpler, instructions per second...
per module. The QS22 based on the PowerXCell 8i processor is used for the IBM Roadrunner supercomputer. Mercury and IBM uses the fully utilized Cell processor with 8 active SPEs. On April 8 2008, Fixstars Corporation released a
PCI ExpressPCI Express , officially abbreviated as PCIe , is a computer expansion card standard designed to replace the older PCI, PCI-X, and AGP standards...
accelerator board based on the PowerXCell 8i processor.
Sony's high performance media computing server
ZEGOThe ZEGO is a rackmount server platform built by Sony, targeted for the video postproduction and broadcast markets. The plattform is based on Sony's PlayStation 3 as it features both the Cell Processor as well as the RSX 'Reality Synthesizer'...
uses a 3.2 GHz Cell/B.E processor.
Overview
The
Cell Broadband Engine—or
Cell as it is more commonly known—is a microprocessor designed to bridge the gap between conventional desktop processors (such as the
Athlon 64The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...
, and Core 2 families) and more specialized high-performance processors, such as the
NVIDIANvidia is a multinational corporation which specializes in the development of graphics processing units and chipset technologies for workstations, personal computers, and mobile devices...
and
ATIAs a word, Ati may refer to:* Ati, a town in Chad* Ati, a Negrito ethnic group in the Philippines** Ati-Atihan Festival, an annual celebration held in the Philippines* Ati, a queen of the fabled Land of Punt in Africa...
graphics-processors (
GPUA graphics processing unit or GPU is a specialized processor that offloads 3D graphics rendering from the microprocessor. It is used in embedded systems, mobile phones, personal computers, workstations, and game consoles...
s). The longer name indicates its intended use, namely as a component in current and future
digital distributionDigital distribution is the practice of providing content in a purely digital format, which is downloaded via the internet straight to a consumer's home...
systems; as such it may be utilized in high-definition displays and recording equipment, as well as computer entertainment systems for the
HDTVHigh-definition television is a digital television broadcasting system with higher resolution than traditional television systems...
era. Additionally the processor may be suited to
digital imagingDigital imaging or digital image acquisition is the creation of digital images, typically from a physical scene. The term is often assumed to imply or include the processing, compression, storage, printing, and display of such images.-History:...
systems (medical, scientific,
etc.) as well as physical simulation (
e.g., scientific and
structural engineeringStructural engineering is a field of engineering dealing with the analysis and design of structures that support or resist loads. Structural engineering is usually considered a specialty within civil engineering, but it can also be studied in its own right....
modeling).
In a simple analysis, the Cell processor can be split into four components: external input and output structures, the main processor called the
Power Processing Element (PPE) (a two-way
simultaneous multithreadedSimultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...
Power ISA v.2.03 compliant core), eight fully-functional co-processors called the
Synergistic Processing Elements, or SPEs, and a specialized high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs, called the
Element Interconnect Bus or EIB.
To achieve the high performance needed for mathematically intensive tasks, such as decoding/encoding MPEG streams, generating or transforming three-dimensional data, or undertaking Fourier analysis of data, the Cell processor marries the SPEs and the PPE via EIB to give access, via fully cache coherent
DMA (direct memory access)Direct memory access is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers,...
, to both main memory and to other external data storage. To make the best of EIB, and to overlap computation and data transfer, each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine. Since the SPE's load/store instructions can only access its own local memory, each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories. A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks. One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer, with a view to enabling maximal asynchrony and concurrency in data processing inside a chip.
The PPE, which is capable of running a conventional operating system, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. To this end the PPE has additional instructions relating to control of the SPEs. Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions. Despite having Turing complete architectures, the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work. Though most of the "horsepower" of the system comes from the synergistic processing elements, the use of
DMADMA can refer to:* Dallas Museum of Art, an art museum in Texas, USA* DMA , a defunct dance music magazine* Danish Music Awards, an award show held in Denmark since 1989...
as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge to software developers who wish to make the most of this horsepower, demanding careful hand-tuning of programs to extract maximal performance from this CPU.
The PPE and bus architecture includes various modes of operation giving different levels of
memory protectionMemory protection is a way to control memory access rights on a computer, and is a part of nearly every modern operating system. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug within a process from...
, allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.
Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format. The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit
AltivecAltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...
register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 128-bits in size or for
SIMDIn computing, SIMD is a technique employed to achieve data level parallelism.- History :...
computations on a variety of integer and floating point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values for a theoretic address range of 2
64 bytes (16 exabytes or 16,777,216 terabytes). In practice, not all of these bits are implemented in hardware. Local store addresses internal to the SPU processor are expressed as a 32-bit word. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.
PowerXCell 8i
In 2008, IBM announced a revised variant of the Cell called the
PowerXCell 8i, which is available in QS22 Blade Servers from IBM. The PowerXCell is manufactured on a 65 nm process, and adds support for up to 32 GB of slotted DDR2 memory, as well as dramatically improving double-precision floating-point performance on the SPEs from a peak of about 12.8 GFLOPS to 102.4 GFLOPS total for eight SPEs. The IBM Roadrunner supercomputer, currently the world's fastest, consists of 12240 PowerXCell 8i processors, along with 6562 AMD Opteron processors.
Beside the QS22 and RoadRunner computers, the PowerXCell processor is also available as an accelerator on a PCI Express card and is used as the core processor in the
QPACEQPACE is pursuing the development of a massive parallel, scalable supercomputer for applications in lattice quantum chromodynamics . The machine structure is a three-dimensional torus of identical processing nodes, based on the IBM PowerXCell 8i processor...
project.
Architecture
While the Cell chip can have a number of different configurations, the basic configuration is a
multi-coreA multi-core processor is a processing system composed of two or more independent cores. The cores are typically integrated onto a single integrated circuit die , or they may be integrated onto multiple dies in a single chip package...
chip composed of one "Power Processor Element" ("PPE") (sometimes called "Processing Element", or "PE"), and multiple "Synergistic Processing Elements" ("SPE"). The PPE and SPEs are linked together by an internal high speed bus dubbed "Element Interconnect Bus" ("EIB"). Due to the nature of its applications, Cell is optimized towards single precision
floating pointIn computing, floating point describes a system for numerical representation in which a string of digits represents a rational number....
computation. The SPEs are capable of performing
double precisionIn computing, a double precision is a usually binary floating-point computer numbering format that occupies 8 bytes in computer memory.In IEEE 754-2008 the 64-bit base 2 format is officially referred to as binary64...
calculations, albeit with an order of magnitude performance penalty. New chips expected mid-2008 are rumored to boost SPE double precision performance as high as 5x over pre-2008 designs. In the meantime, there are ways to circumvent this in software using iterative refinement, which means values are calculated in double precision only when necessary.
Jack DongarraJack J. Dongarra is a University Distinguished Professor of Computer Sciencein the Electrical Engineering and Computer Science Department at the University of Tennessee...
and his team
demonstrated a 3.2 GHz Cell with 8 SPEs delivering a performance equal to 100 GFLOPS on an average double precision
LinpackLINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and was intended for use on supercomputers in the 1970s and early 1980s...
4096x4096 matrix.
Power Processor Element (PPE)
The
PPE is the
Power ArchitecturePower Architecture is a broad term to describe similar RISC instruction sets for microprocessors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi...
based, two-way
multithreadedSimultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...
core acting as the controller for the eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64-bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32 KiB instruction and a 32 KiB data Level 1
cacheIn computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data is expensive to fetch or to compute, compared to the cost of reading the cache. In other words, a cache is a temporary storage area where frequently...
and a 512 KiB Level 2 cache. The size of a cache line is 128 bytes. Additionally, IBM has included an
AltiVecAltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...
unit which is fully pipelined for single precision floating point. (Altivec does not support
double precisionIn computing, a double precision is a usually binary floating-point computer numbering format that occupies 8 bytes in computer memory.In IEEE 754-2008 the 64-bit base 2 format is officially referred to as binary64...
floating-point vectors.) Each PPU can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fused-multiply-add instruction, which translates to 25.6 GFLOPS at 3.2 GHz.
Xenon in Xbox 360
The PPE was designed specifically for the Cell processor but during development,
MicrosoftMicrosoft Corporation is a multinational computer technology corporation that develops, manufactures, licenses, and supports a wide range of software products for computing devices...
approached IBM wanting a high performance processor core for its
Xbox 360The Xbox 360 is the second video game console produced by Microsoft, and the successor to the Xbox. The Xbox 360 competes with Sony's PlayStation 3 and Nintendo's Wii as part of the seventh generation of video game consoles....
. IBM complied and made the tri-core
Xenon processorXenon is a CPU that is used in the Xbox 360 game console. The processor, internally codenamed "Waternoose" by IBM and "XCPU" by Microsoft, is based on IBM's PowerPC instruction set architecture, consisting of three independent processor cores on a single die...
, based on a slightly modified version of the PPE.
Synergistic Processing Elements (SPE)
Each SPE is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (
DMADirect memory access is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers,...
,
MMUA memory management unit , sometimes called paged memory management unit , is a computer hardware component responsible for handling accesses to memory requested by the central processing unit...
, and bus interface). An SPE is a RISC processor with
128-bitThere are currently no mainstream general-purpose processors built to operate on 128-bit integers or addresses, though a number of processors do operate on 128-bit data. System/370, made by IBM, could be considered the first rudimentary 128-bit computer as it used 128-bit floating point registers...
SIMDIn computing, SIMD is a technique employed to achieve data level parallelism.- History :...
organization for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256 KiB
embedded SRAM1T-SRAM is a pseudostatic RAM memory technology introduced by MoSys, Inc., which offers a high-density alternative to traditional SRAM in embedded memory applications...
for instruction and data, called "Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM) which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4
GiBGib may refer to:* A castrated male cat or ferret* Gibibit , a unit of information used, for example, to quantify computer memory or storage capacity* Gibraltar* Drywall, a construction material...
of local store memory. The local store does not operate like a conventional CPU
cacheIn computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data is expensive to fetch or to compute, compared to the cost of reading the cache. In other words, a cache is a temporary storage area where frequently...
since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit, 128-entry
register fileA register file is an array of processor registers in a central processing unit . Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports...
and measures 14.5 mm
2 on a 90 nm process. An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.
In one typical usage scenario, the system will load the SPEs with small programs (similar to
threadsIn computer science, a thread of execution results from a fork of a computer program into two or more concurrently running tasks. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
), chaining the SPEs together to handle each step in a complex operation. For instance, a
set-top boxA set-top box or set-top unit is a device that connects to a television and an external source of signal, turning the signal into content which is then displayed on the television screen.- History :...
might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance.
Compared to a modern
personal computerA personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end user, with no intervening computer operator...
, the relatively high overall floating point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in desktop CPUs like the
Pentium 4The Pentium 4 brand refers to Intel's line of single-core mainstream and high-end desktop and laptop central processing units introduced on November 20, 2000...
and the
Athlon 64The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...
. However, comparing only floating point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature
branch predictorIn computer architecture, a branch predictor is a digital circuit that tries to guess which way a branching will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline...
s. The Cell is designed to compensate for this with compiler assistance, in which prepare-to-branch instructions are created. For double-precision floating point operations, as sometimes used in personal computers and often used in scientific computing, Cell performance drops by an order of magnitude, but still reaches 12.8 GFLOPS (the PowerXCell 8i variant, which was specifically designed for double-precision, reaches 102.4 GFLOPS in double-precision calculations ).
Recent tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication.
Toshibais a Japanese multinational conglomerate manufacturing company, headquartered in Tokyo, Japan. The company's main business is in infrastructure, consumer products, electronic devices and components.Toshiba-made Semiconductors are among the Worldwide Top 20 Semiconductor Sales Leaders...
has developed a co-processor powered by four SPEs, but no PPE, called the
SpursEngineSpursEngine is a microprocessor from Toshiba built as a media oriented coprocessor, designed for 3D- and video processing in consumer electronics such as set-top boxes and computers...
designed to accelerate 3D and movie effects in consumer electronics.
Element Interconnect Bus (EIB)
The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants in the PS3 (the number of SPU can vary in industrial applications). The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as 'units'.
The EIB is presently implemented as a circular ring comprising four 16B-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum
concurrencyConcurrency, concurrent, or concurrence may refer to:* Concurrence, a legal term referring to the need to prove both actus reus and mens rea...
, with three active transactions on each of the four rings, the peak
instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature it is unrealistic to simply scale this number by processor clock speed. The arbitration unit imposes additional constraints which are discussed in the
Bandwidth Assessment section below.
IBM Senior Engineer David Krolak, EIB lead designer, explains the concurrency model:
- A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it does not work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track.
Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated
DMADirect memory access is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers,...
management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.
Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances
are detrimental to the overall performance of the EIB as they reduce available concurrency.
Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.
David Krolak explains:
- Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.
Bandwidth assessment
For the sake of quoting performance numbers, we will assume a Cell processor running at 3.2 GHz, the clock speed most often cited.
At this clock frequency each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300 GB/s". This number reflects the peak
instantaneous EIB bandwidth scaled by processor frequency.
However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explains:
- Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s.
This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.
In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes
combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.
To add further to the confusion, some older publications cite EIB bandwidth assuming a 4 GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384 GB/s and an arbitration-limited bandwidth figure of 256 GB/s.
All things considered the theoretic 204.8 GB/s number most often cited is the best one to bear in mind. The
IBM Systems Performance group has demonstrated SPU-centric data flows achieving 197 GB/s on a Cell processor running at 3.2 GHz so this number is a fair reflection on practice as well .
Optical interconnect
Sony is currently working on the development of an optical interconnection technology for use in the device-to-device or internal interface of various types of cell-based digital consumer electronics and game systems.
Memory controller and I/O
Cell contains a dual channel
RambusRambus Incorporated , founded in 1990, is a provider of high-speed interface technology. The company became particularly well known for its aggressive intellectual property based litigation practices following the introduction of DDR-SDRAM memory....
XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32-bit channels can provide a theoretical maximum of 25.6 GB/s.
The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.
Video Processing Card
Some companies, such as
LeadtekLeadtek Research, Inc. is a Taiwanese company, founded in 1986, which focuses on research and development that is specialized in the design and manufacture of graphics cards.- Products :...
, have plans to release a PCI-E card based upon the Cell to allow for "faster than real time" transcoding of H.264,
MPEG-2MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods which permit storage and transmission of movies using currently available storage media and transmission...
and
MPEG-4MPEG-4 is a patented collection of methods defining compression of audio and visual digital data. It was introduced in late 1998 and designated a standard for a group of audio and video coding formats and related technology agreed upon by the ISO/IEC Moving Picture Experts Group under the formal...
video.
Blade Server
On the 29 August 2007, IBM announced the BladeCenter QS21. Generating a measured 1.05 Giga Floating Point Operations Per Second (GigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 Tera Floating Point Operations Per Second (TeraFLOPS) and over 25.8 TeraFLOPS in a standard 42U rack.
IBM Press Release
On 13 May 2008, IBM announced the BladeCenter QS22. The QS22 introduces the PowerXCell 8i processor with five times the double-precision Floating Point performance of the QS21, and the capacity for up to 32GB of DDR2 memory on-blade.
IBM Press Release
PCI Express Board
Several companies provide PCI-e boards utilising the IBM PowerXCell 8i. The performance is reported as 179.2 GFlops (SP), 89.6 GFlops (DP) at 2.8 GHz.
Console video games
Sonyis a multinational conglomerate corporation headquartered in Minato, Tokyo, Japan, and one of the world's largest media conglomerates with revenue exceeding ¥ 7.730.0 trillion, or $78.88 billion U.S. . Sony is one of the leading manufacturers of electronics, video, communications, video game...
's
PlayStation 3The PlayStation 3 is the third home video game console produced by Sony Computer Entertainment, and the successor to the PlayStation 2 as part of the PlayStation series...
video game consoleA video game console is an interactive entertainment computer or electronic device that produces a video display signal which can be used with a display device to display a video game...
contains the first production application of the Cell processor, clocked at 3.2
GHzGHZ or GHz may refer to:# Gigahertz .# Greenberger-Horne-Zeilinger state - a quantum entanglement of three particles.# Galactic Habitable Zone - the region of a galaxy that is favorable to the formation of life....
and containing seven out of eight operational SPEs, to allow Sony to increase the yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS.
Home cinema
Reportedly, Toshiba is considering producing
HDTVsHigh-definition television is a digital television broadcasting system with higher resolution than traditional television systems...
using Cell. They have already presented a system to decode 48
standard definitionStandard-definition television is a television system that has a resolution that meets standards but not considered either Enhanced-definition television or High-definition television . The term is usually used in reference to digital television, in particular when broadcasting at the same ...
MPEG-2MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods which permit storage and transmission of movies using currently available storage media and transmission...
streams simultaneously on a
1920×10801080i is the shorthand name of a format of high-definition video modes. 1080 denotes the number of horizontal scan lines - also known as vertical resolution - and the letter i stands for interlaced. In the alternate format of high-definition video mode, known as 1080p, the p would stand for...
screen. This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.
Supercomputing
IBM's latest supercomputer, IBM Roadrunner, is a hybrid of General Purpose CISC Opteron as well as Cell processors. This system assumed the #1 spot on the June 2008 Top 500 list as the first supercomputer to run at
petaFLOPSIn computing, FLOPS is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations, similar to the older, simpler, instructions per second...
speeds, having gained a sustained 1.026 petaFLOPS speed using the standard Linpack benchmark. IBM Roadrunner uses the PowerXCell 8i version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128-bit registers, reaching double precision 102 GFLOPs per chip.
Cluster computing
Clusters of
PlayStation 3The PlayStation 3 is the third home video game console produced by Sony Computer Entertainment, and the successor to the PlayStation 2 as part of the PlayStation series...
consoles are an attractive alternative to high-end systems based on Cell blades. Innovative Computing Laboratory, a group led by
Jack DongarraJack J. Dongarra is a University Distinguished Professor of Computer Sciencein the Electrical Engineering and Computer Science Department at the University of Tennessee...
, in the Computer Science Department at the University of Tennessee, investigated such an application in depth. Terrasoft Solutions is selling 8-node and 32-node PS3 clusters with
Yellow Dog LinuxYellow Dog Linux, also YDL, is a free and open source operating system for Power Architecture computers. Developed by Fixstars , Yellow Dog Linux was first released in 1999 for the Apple Macintosh...
pre-installed, an implementation of Dongarra's research.
As reported by Wired Magazine on October 17, 2007, an interesting application of using PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna, from the Physics department of University of Massachusetts Dartmouth, who replaced time used on supercomputers with a cluster of eight PlayStation 3s. Subsequently, the next generation of this machine, now called the
PlayStation 3The PlayStation 3 is the third home video game console produced by Sony Computer Entertainment, and the successor to the PlayStation 2 as part of the PlayStation series...
Gravity Grid, uses a network of 16 machines, and exploits the Cell processor for the intended application which is binary
black holeIn general relativity, a black hole is a region of space in which the gravitational field is so powerful that nothing, not even light, can escape. The black hole has a one-way surface, called an event horizon, into which objects can fall, but out of which nothing can come...
coalescence using
perturbation theoryPerturbation theory comprises mathematical methods that are used to find an approximate solution to a problem which cannot be solved exactly, by starting from the exact solution of a related problem...
. The Cell processor version used by the Playstation 3 has a main CPU and 6 floating-point vector processors, giving the Gravity Grid machine a net of 16 general-purpose processors and 96 vector processors. The machine has a one-time cost of $9,000 to build and is adequate for black-hole simulations which would otherwise cost $6,000 per run on a conventional supercomputer. The black hole calculations are not memory-intensive and are highly localizable, and so are well-suited to this architecture.
The computational Biochemistry and Biophysics lab at the Universitat Pompeu Fabra, in
BarcelonaBarcelona is the capital, most populous city of the Autonomous Community of Catalonia and the second largest city in Spain, with a population of 1,615,908 in 2008. It is the 11th-most populous municipality in the European Union and sixth-most populous urban area in the European Union after Paris,...
, deployed in 2007 a BOINC system called PS3GRID for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.
Distributed Computing
With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project
Folding@HomeFolding@home is a distributed computing project designed to perform computationally intensive simulations of protein folding and other molecular dynamics , and to improve on the methods available to do so...
has been recognized by
Guinness World RecordsGuinness World Records, known until 2000 as The Guinness Book of Records , is a reference book published annually, containing an internationally recognised...
as the most powerful distributed network in the world. The first record was achieved on September 16, 2007, as the project surpassed one
petaFLOPSIn computing, FLOPS is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations, similar to the older, simpler, instructions per second...
, which had never been reached before by a distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the petaFLOPS mark on September 23, 2007. In comparison, the world's second most powerful supercomputer at the time, IBM's BlueGene/L, performed at around 478.2 teraFLOPS. This means Folding@Home's computing power is approximately twice BlueGene/L's (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.). In late 2008, A cluster of 200 PlayStation 3 consoles was used to generate a rogue
SSLTransport Layer Security and its predecessor, Secure Sockets Layer , are cryptographic protocols that provide security for communications over networks such as the Internet...
certificate, effectively cracking its encryption.
Mainframes
IBM announced April 25, 2007 that it will begin integrating its Cell Broadband Engine Architecture microprocessors into the company's line of mainframes.
Software engineering
Due to the flexible nature of the Cell, there are several possibilities for the utilization of its resources, not limited to just different computing paradigms:
Job queue
The PPE maintains a job queue, schedules jobs in SPEs, and monitors progress. Each SPE runs a "mini kernel" whose role is to fetch a job, execute it, and synchronize with the PPE.
Self-multitasking of SPEs
The kernel and scheduling is distributed across the SPEs. Tasks are synchronized using
mutexesMutual exclusion algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections. A critical section is a piece of code where a process or thread accesses a common resource...
or
semaphoresIn computer science, a semaphore is a protected variable or abstract data type which constitutes the classic method for restricting access to shared resources such as shared memory in a parallel programming environment. A counting semaphore is a counter for a set of available resources, rather than...
as in a conventional
operating systemAn operating system is an interface between hardware and user which is responsible for the management and coordination of activities and the sharing of the resources of the computer that acts as a host for computing applications run on the machine. As a host, one of the purposes of an operating...
. Ready-to-run tasks wait in a queue for an SPE to execute them. The SPEs use shared memory for all tasks in this configuration.
Stream processing
Each SPE runs a distinct program. Data comes from an input stream, and is sent to SPEs. When an SPE has terminated the processing, the output data is sent to an output stream.
This provides a flexible and powerful architecture for
stream processingStream processing is a computer programming paradigm, related to SIMD, that allows some applications to more easily exploit a limited form of parallel processing...
, and allows explicit scheduling for each SPE separately. Other processors are also able to perform streaming tasks, but are limited by the kernel loaded.
Open source software development
An open source software-based strategy was adopted to accelerate the development of a Cell BE ecosystem and to provide an environment to develop Cell applications. In 2005, patches enabling Cell support in the Linux kernel were submitted for inclusion by IBM developers. Arnd Bergmann (one of the developers of the aforementioned patches) also described the Linux-based Cell architecture at
LinuxTagLinuxTag is a Free Software expo with an emphasis on Linux , held every summer in Germany. It is relatively large, claiming that it is the largest expo of this kind in Europe, drawing visitors from many countries....
2005.
Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries.
Fixstars SolutionsFixstars Solutions, Inc is a software and services company specializing in Power Architecture solutions, particularly in the Cell. It was formerly known as Terra Soft Solutions....
provides
Yellow Dog LinuxYellow Dog Linux, also YDL, is a free and open source operating system for Power Architecture computers. Developed by Fixstars , Yellow Dog Linux was first released in 1999 for the Apple Macintosh...
for IBM, and Mercury Cell-based systems, as well as for the Playstation 3. Terra Soft strategically partnered with Mercury to provide a Linux Board Support Package for Cell, and support and development of software applications on various other Cell platforms, including the IBM BladeCenter JS21 and Cell QS20, and Mercury Cell-based solutions. Terra Soft also maintains the Y-HPC(High Performance Computing) Cluster Construction and Management Suite and Y-Bio gene sequencing tools. Y-Bio is built upon the RPM Linux standard for package management, and offers tools which help bioinformatics researchers conduct their work with greater efficiency. IBM has developed a pseudo-filesystem for Linux coined "Spufs" that simplifies access to and use of the SPE resources. IBM is currently maintaining a Linux kernel and GDB ports, while Sony maintains the
GNU toolchainThe GNU toolchain is a blanket term for a collection of programming tools produced by the GNU Project. These tools form a toolchain used for developing applications and operating systems....
(
GCCThe GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain...
, binutils).
In November 2005, IBM released a "Cell Broadband Engine (CBE) Software Development Kit Version 1.0", consisting of a simulator and assorted tools, to its web site. Development versions of the latest kernel and tools for
Fedora CoreFedora is an RPM-based, general purpose operating system built on top of the Linux kernel, developed by the community-supported Fedora Project and sponsored by Red Hat...
4 are maintained at the
Barcelona Supercomputing CenterBarcelona Supercomputing Center or Centro Nacional de Supercomputación is a public research center located in Barcelona, Spain...
website.
In August 2007, Mercury Computer Systems released a Software Development Kit for PLAYSTATION(R)3 for High-Performance Computing.
In November 2007, Fixstars Corporation released the new "CVCell" module aiming to accelerate several important
OpenCVOpenCV is a computer vision library originally developed by Intel. It is free for commercial and research use under the open source BSD license. The library is cross-platform, and runs on Windows, Mac OS X, Linux, PSP, VCRT and other embedded devices...
APIs for Cell. It achieved 27 times faster on PLAYSTATION 3 Linux than Intel Core 2 Duo.
With the release of kernel version 2.6.16 on March 20 2006, the Linux kernel officially supports the Cell processor.
External links