All Topics  
Parallel computing

 

   Email Print
   Bookmark   Link

 

Parallel computing


 
 



Parallel computing is a form of computationComputing

Originally, the word computing was synonymous with counting and calculating, and a science and technology that deals wit...
 in which many instructionsInstruction (computer science)

In computer science, an instruction typically refers to a single operation of a processor within a computer architecture....
 are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrentlyConcurrency (computer science)

In computer science, concurrency is a property of systems which consist of computations that execute overlapped in time, and...
 ("in parallel"). There are several different forms of parallel computing: bit-level parallelismBit-level parallelism Overview

Bit-level parallelism is a form of parallel computing based on increasing processor word size....
, instruction-level parallelismInstruction level parallelism

Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneous...
, data parallelismData parallelism

Data parallelism is a form of parallelization of computing across multiple processors in parallel computing environments....
, and task parallelismTask parallelism

Task parallelism is a form of parallelization of computer code across multiple processors in parallel computing environment...
. It has been used for many years, mainly in high-performance computing, but interest in it has grown in recent years due to the physical constraints preventing frequency scalingFrequency scaling

Frequency scaling is, in computer architecture, the technique of ramping a processor's frequency so as to achieve performan...
. Parallel computing has become the dominant paradigm in computer architectureComputer architecture

In computer engineering, computer architecture is the conceptual design and fundamental operational structure of a computer ...
, mainly in the form of multicore processorMulti-core (computing)

A multi-core microprocessor is one which combines two or more independent processors into a single package, often a single i...
s. However, in recent years, power consumptionPower consumption

In electrical engineering, power consumption refers to the electrical energy over time that must be supplied to an electrica...
 by parallel computers has become a concern.

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism—with multi-core and multi-processorSymmetric multiprocessing

Symmetric Multiprocessing, or SMP, is a multiprocessor computer architecture where two or more identical processors ar...
 computers having multiple processing elements within a single machine, while clustersComputer cluster

A computer cluster is a group of loosely coupled computers that work together closely so that in many respects they can be v...
, MPPs, and gridsGrid computing Overview

Grid computing is an emerging computing model that provides the ability to perform higher throughput computing by taking adv...
 use multiple computers to work on the same task.

Parallel computer programsParallel algorithm

In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piec...
 are more difficult to write than sequential ones, because concurrency introduces several new classes of potential software bugSoftware bug

A software bug is an error, flaw, mistake, failure, or fault in a computer program that prevents it from working as intended...
s, of which race conditionRace condition

A race condition or race hazard is a flaw in a system or process whereby the output of the process is unexpectedly and...
s are the most common. CommunicationComputer networking

Computer networking is the scientific and engineering discipline concerned with communication between computer systems....
 and synchronizationSynchronization (computer science) Summary

In computer science, especially parallel computing, synchronization means the coordination of simultaneous threads or proces...
 between the different subtasks is typically one of the greatest barriers to getting good parallel program performance. The speed-upSpeedup

In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm....
 of a program as a result of parallelization is given by Amdahl's lawAmdahl's law

Amdahl's law, named after computer architect Gene Amdahl, is used to find the maximum expected improvement to...
.

Background

Traditionally, computer software has been written for serial computation. To solve a problem, an algorithmAlgorithm

In mathematics and computing, an algorithm is a procedure for accomplishing some task which, given an initial state, will t...
 is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unitCentral processing unit

A central processing unit , or sometimes simply processor, is the component in a digital computer that interprets ins...
 on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.

Parallel computing, on the other hand, uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above.

Frequency scalingFrequency scaling

Frequency scaling is, in computer architecture, the technique of ramping a processor's frequency so as to achieve performan...
 was the dominant reason for improvements in computer performance from the mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied by the average time per instruction. Maintaining everything else constant, increasing the clock frequency decreases the average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all computation-boundedCPU bound Summary

In computer science, CPU bound refers to a condition where the time to complete a computation is determined principally by t...
 programs.

However, power consumptionPower consumption

In electrical engineering, power consumption refers to the electrical energy over time that must be supplied to an electrica...
 by a chip is given by the equation P = C × V2 × F, where P is power, C is the capacitanceCapacitance

Capacitance is a measure of the amount of electric charge stored for a given electric potential....
 being switched per clock cycle (proportional to the number of transistors whose inputs change), V is voltageVoltage

Voltage is the difference of electrical potential between two points of an electrical network, expressed in volts ....
, and F is the processor frequency (cycles per second). Increases in frequency increase the amount of power used in a processor. Increasing processor power consumption led ultimately to Intel's May 2004 cancellation of its Tejas and JayhawkTejas and Jayhawk

Tejas was a code name for Intel's microprocessor which was to be a successor to the latest Pentium 4 with Prescott core....
 processors, which is generally cited as the end of frequency scaling as the dominant computer architecture paradigm.

Moore's LawMoore's Law

Moore's Law is the empirical observation that the transistor density of integrated circuits, with respect to minimum compone...
 is the empirical observation that transistor density in a microprocessor doubles every 18 to 24 months. Despite power consumption issues, and repeated predictions of its end, Moore's law is still in effect. With the end of frequency scaling, these additional transistors (which are no longer used for frequency scaling) can be used to add extra hardware for parallel computing.

Amdahl's law and Gustafson's law


Theoretically, the speed-up from parallelization should be linear—doubling the number of processing elements should halve the runtime, and doubling it a second time should again halve the runtime. However, very few parallel algorithms achieve optimal speed-up. Most of them have a near-linear speed-up for small numbers of processing elements, which flattens out into a constant value for large numbers of processing elements.

The potential speed-up of an algorithm on a parallel computing platform is given by Amdahl's lawAmdahl's law

Amdahl's law, named after computer architect Gene Amdahl, is used to find the maximum expected improvement to...
, originally formulated by Gene AmdahlGene Amdahl Overview

Gene Myron Amdahl is a Norwegian American computer architect and hi-tech entrepreneur, chiefly known for his work on mainfra...
 in the 1960s. It states that a small portion of the program which cannot be parallelized will limit the overall speed-up available from parallelization. Any large mathematical or engineering problem will typically consist of several parallelizable parts and several non-parallelizable (sequential) parts. This relationship is given by the equation:

where S is the speed-up of the program (as a factor of its original sequential runtime), and P is the fraction that is parallelizable. If the sequential portion of a program is 10% of the runtime, we can get no more than a 10x speed-up, regardless of how many processors are added. This puts an upper limit on the usefulness of adding more parallel execution units. "When a task cannot be partitioned because of sequential constraints, the application of more effort has no effect on the schedule. The bearing of a child takes nine months, no matter how many women are assigned."

Gustafson's lawGustafson's law

Gustafson's Law is a law in computer engineering which states that any sufficiently large problem can be efficiently paralle...
 is another law in computer engineering, closely related to Amdahl's law. It can be formulated as:

where P is the number of processors, S is the speed-up, and the non-parallelizable part of the process. Amdahl's law assumes a fixed-problem size and that the size of the sequential section is independent of the number of processors, whereas Gustafson's law does not make these assumptions.

Dependencies

Understanding data dependenciesFacts About Data dependency

A data dependency in computer science is a situation whereby computer instructions refer to the results of preceding instruc...
 is fundamental in implementing parallel algorithmParallel algorithm

In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piec...
s. No program can run more quickly than the longest chain of dependent calculations (known as the critical pathCritical Path

The term may refer to*Critical path, a project management notion...
), since calculations that depend upon prior calculations in the chain must be executed in order. However, most algorithms do not consist of just a long chain of dependent calculations; there are usually opportunities to execute independent calculations in parallel.

Let Pi and Pj be two program fragments. Bernstein's conditions describe when the two are independent and can be executed in parallel. Let Ii be all of the input variables to Pi and Oi the output variables, and likewise for Pj. P i and Pj are independent if they satisfy




Violation of the first condition introduces a flow dependency, corresponding to the first statement producing a result used by the second statement. The second condition represents an anti-dependency, when the first statement overwrites a variable needed by the second expression. The third and final condition, q, is an output dependency. When two variables write to the same location, the final output must have arisen from the second statement.

Consider the following functions, which demonstrate several kinds of dependencies:

1: function Dep(a, b)
2: c := a·b
3: d := 2·c
4: end function

Operation 3 in Dep(a, b) cannot be executed before (or even in parallel with) operation 2, because operation 3 uses a result from operation 2. It violates condition 1, and thus introduces a flow dependency.

1: function NoDep(a, b)
2: c := a·b
3: d := 2·b
4: e := a+b
5: end function

In this example, there are no dependencies between the instructions, so they can all be run in parallel.

Bernstein’s conditions do not allow memory to be shared between different processes. For that, some means of enforcing an ordering between accesses is necessary, such as semaphoresSemaphore (programming)

A semaphore is a protected variable and constitutes the classic method for restricting access to equivalent shared res...
, barriersBarrier (computer science)

In parallel computing, a barrier is a type of synchronization method....
 or some other synchronization methodSynchronization (computer science)

In computer science, especially parallel computing, synchronization means the coordination of simultaneous threads or proces...
.

Race conditions, mutual exclusion, synchronization, and parallel slowdown

Subtasks in a parallel program are often called threadsThread (computer science)

A thread in computer science is short for a thread of execution....
. Some parallel computer architectures use smaller, lightweight versions of threads known as fibersFiber (computer science)

A fiber in computer science is a term for a particularly lightweight thread of execution....
, while others use bigger versions known as processesFacts About Process (computing)

In computing, a process is a running instance of a program, including all variables and other state....
. However, "threads" is generally accepted as a generic term for subtasks. Threads will often need to update some variableFacts About Variable

In computer science and mathematics, a variable is a symbol denoting a quantity or symbolic representation....
 that is shared between them. The instructions between the two programs may be interleaved in any order. For example, consider the following program:

Thread AThread B
1A: Read variable V1B: Read variable V
2A: Add 1 to variable V2B: Add 1 to variable V
3A Write back to variable V3B: Write back to variable V


If instruction 1B is executed between 1A and 3A, or if instruction 1A is executed between 1B and 3B, the program will produce incorrect data. This is known as a race conditionRace condition

A race condition or race hazard is a flaw in a system or process whereby the output of the process is unexpectedly and...
. The programmer must use a lockLock (computer science)

In computer science, a lock is a synchronization mechanism for enforcing limits on access to a resource in an environment wh...
 to provide mutual exclusionMutual exclusion

Mutual exclusion algorithms are used in concurrent programming to avoid the simultaneous use of un-shareable resources by p...
. A lock is a programming language construct that allows one thread to take control of a variable and prevent other threads from reading or writing it, until that variable is unlocked. The thread holding the lock is free to execute its critical sectionCritical section

In computer programming a critical section is a piece of code that accesses a shared resource that must not be concurrently ...
 (the section of a program that requires exclusive access to some variable), and to unlock the data when it is finished. Therefore, to guarantee correct program execution, the above program can be rewritten to use locks:

Thread AThread B
1A: Lock variable V1B: Lock variable V
2A: Read variable V2B: Read variable V
3A: Add 1 to variable V3B: Add 1 to variable V
4A Write back to variable V4B: Write back to variable V
5A: Unlock variable V5B: Unlock variable V


One thread will successfully lock variable V, while the other thread will be locked outSoftware lockout

In multiprocessor computer systems, software lockout is the issue of performance degradation due to the idle wait times spen...
—unable to proceed until V is unlocked again. This guarantees correct execution of the program. Locks, while necessary to ensure correct program execution, can greatly slow a program.

Locking multiple variables using non-atomicAtomic operation

An atomic operation in computer science refers to a set of operations that can be combined so that they appear to the rest o...
 locks introduces the possibility of program deadlockDeadlock

A deadlock is a situation wherein two or more competing actions are waiting for the other to finish, and thus neither ever ...
. An atomic lock locks multiple variables all at once. If it cannot lock all of them, it does not lock any of them. If two threads each need to lock the same two variables using non-atomic locks, it is possible that one thread will lock one of them and the second thread will lock the second variable. In such a case, neither thread can complete, and deadlock results.

Many parallel programs require that their subtasks act in synchronySynchronization (computer science)

In computer science, especially parallel computing, synchronization means the coordination of simultaneous threads or proces...
. This requires the use of a barrierBarrier (computer science)

In parallel computing, a barrier is a type of synchronization method....
. Barriers are typically implemented using a software lock. One class of algorithms, known as lock-free and wait-free algorithmsLock-free and wait-free algorithms

In contrast to algorithms that protect access to shared data with locks, lock-free and wait-free algorithms are specially de...
, altogether avoids the use of locks and barriers. However, this approach is generally difficult to implement and requires correctly designed data structures.

Not all parallelization results in speed-up. Generally, as a task is split up into more and more threads, those threads spend an ever-increasing portion of their time communicating with each other. Eventually, the overhead from communication dominates the time spent solving the problem, and further parallelization (that is, splitting the workload over even more threads) increases rather than decreases the amount of time required to finish. This is known as parallel slowdownFacts About Parallel slowdown

Parallel slowdown is a phenomenon in parallel computing where parallelization of a parallel computer program beyond a certai...
.

Fine-grained, coarse-grained, and embarrassing parallelism

Applications are often classified according to how often their subtasks need to synchronize or communicate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many times per second; it exhibits coarse-grained parallelism if they do not communicate many times per second, and it is embarrassingly parallelEmbarrassingly parallel

In the jargon of parallel computing, an embarrassingly parallel workload is one for which no particular effort is needed to ...
 if they rarely or never have to communicate. Embarrassingly parallel applications are considered the easiest to parallelize.

Consistency models



Parallel programming languages and parallel computers must have a consistency modelConsistency model

In computer science, in a distributed system such as a distributed shared memory system or a distributed data store such as a data...
 (also known as a memory model). The consistency model defines rules for how operations on computer memory occur and how results are produced.

One of the first consistency models was Leslie LamportLeslie Lamport

Dr. Leslie Lamport is an American computer scientist....
's sequential consistencySequential consistency

Sequential consistency is one of the consistency models used in the domain of the concurrent programming....
 model. Sequential consistency is the property of a parallel program that its parallel execution produces the same results as a sequential program. Specifically, a program is sequentially consistent if "... the results of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program".

Software transactional memoryFacts About Software transactional memory

In computer science, software transactional memory is a concurrency control mechanism analogous to database transactions for contr...
 is a common type of consistency model. Software transactional memory borrows from database theoryDatabase management system

A database management system is a system or software designed to manage a database, and run operations on the data requeste...
 the concept of atomic transactions and applies them to memory accesses.

Mathematically, these models can be represented in several ways. Petri netPetri net

A Petri net is one of several mathematical representations of discrete distributed systems....
s, which were introduced in Carl Adam Petri's 1962 doctoral thesis, were an early attempt to codify the rules of consistency models. Dataflow theory later built upon these, and Dataflow architectureDataflow architecture

Dataflow architecture is a computer architecture that directly contrasts the traditional von Neuman or control flow architec...
s were created to physically implement the ideas of dataflow theory. Beginning in the late 1970s, process calculi such as calculus of communicating systemsCalculus of Communicating Systems

The Calculus of Communicating Systems is a process calculus developed by Robin Milner....
 and communicating sequential processesCommunicating sequential processes

In computer science, Communicating Sequential Processes is a formal language for describing patterns of interaction in concu...
 were developed to permit algebraic reasoning about systems composed of interacting components. More recent additions to the process calculus family, such as the π-calculus, have added the capability for reasoning about dynamic topologies. Logics such as Lamport's TLA+Temporal Logic of Actions

Temporal Logic of Actions is a logic developed by Leslie Lamport, which combines temporal logic with a logic of actions....
, and mathematical models such as tracesTrace theory Overview

In mathematics and computer science, trace theory aims to provide a concrete mathematical underpinning for the study of conc...
 and Actor event diagramsActor model theory

In theoretical computer science, Actor model theory concerns theoretical issues for the Actor model....
, have also been developed to describe the behavior of concurrent systems.

Flynn's taxonomy

Michael J. FlynnMichael J. Flynn

Michael J. Flynn is an American professor emeritus at Stanford University....
 created one of the earliest classification systems for parallel (and sequential) computers and programs, now known as Flynn's taxonomyFlynn's Taxonomy

Flynn's taxonomy is a classification of computer architectures, proposed by Michael J....
. Flynn classified programs and computers by whether they were operating using a single set or multiple sets of instructions, whether or not those instructions were using a single or multiple sets of data.

The single-instruction-single-data (SISD) classification is equivalent to an entirely sequential program. The single-instruction-multiple-data (SIMD) classification is analogous to doing the same operation repeatedly over a large data set. This is commonly done in signal processingSignal processing

Signal processing is the processing, amplification and interpretation of signals and deals with the analysis and manipulatio...
 applications. Multiple-instruction-single-data (MISD) is a rarely used classification. While computer architectures to deal with this were devised (such as systolic arraySystolic array

In computer architecture, a systolic array is an arrangement of central processing units in an array where data flows synchr...
s), few applications that fit this class materialized. Multiple-instruction-multiple-data (MIMD) programs are by far the most common type of parallel programs.

According to David A. Patterson and John L. HennessyJohn L. Hennessy

John LeRoy Hennessy, the founder of MIPS Computer Systems Inc., is currently serving as the 10th President of Stanford Unive...
, "Some machines are hybrids of these categories, of course, but this classic model has survived because it is simple, easy to understand, and gives a good first approximation. It is also—perhaps because of its understandability—the most widely used scheme."

Types of parallelism

Bit-level parallelism

From the advent of very-large-scale integrationVery-large-scale integration

Very-large-scale integration is the process of creating integrated circuits by combining thousands of transistor-based circ...
 (VLSI) computer-chip fabrication technology in the 1970s until about 1986, speed-up in computer architecture was driven by doubling computer word size—the amount of information the processor can execute per cycle. Increasing the word size reduces the number of instructions the processor must execute to perform an operation on variables whose sizes are greater than the length of the word. For example, where an 8-bit8-bit

8-bit CPUs normally use an 8-bit data bus and a 16-bit address bus which means that their address space is limited to 64 kil...
 processor must add two 16-bit16-bit

Prominent 16-bit processors include the pdp-11, Intel 8086, Motorola 68000, Intel 80286 and the WDC 65C816....
 integerInteger

The integers consist of the positive natural numbers , their negatives and the number zero....
s, the processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with a single instruction.

Historically, 4-bit4-bit

The Intel 4004, the world's first commercially available single-chip microprocessor, was a 4-bit CPU....
 microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit processors, which has been a standard in general-purpose computing for two decades. Not until recently (c. 2003–2004), with the advent of x86-64X86-64

x86-64 is a 64-bit microprocessor architecture and corresponding instruction set; it is a superset of the x86 architecture, ...
 architectures, have 64-bit64-bit

As of 2004, 64-bit CPUs are common in servers, and have recently been introduced to the mainstream personal computer arena i...
 processors become commonplace.

Instruction-level parallelism



A computer program is, in essence, a stream of instructions executed by a processor. These instructions can be re-orderedOut-of-order execution Overview

In computer engineering, out-of-order execution, OoOE, is a paradigm used in most high-speed microprocessors in order ...
 and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mid-1980s until the mid-1990s.

Modern processors have multi-stage instruction pipelineInstruction pipeline Summary

An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase thei...
s. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a processor with an N-stage pipeline can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor, with five stages: instruction fetch, decode, execute, memory access, and write back. The Pentium 4Pentium 4

The Pentium 4 is a seventh-generation x86 architecture microprocessor produced by Intel and is their first all-new CPU desig...
 processor had a 35-stage pipeline.

In addition to instruction-level parallelism from pipelining, some processors can issue more than one instruction at a time. These are known as superscalarSuperscalar

A superscalar CPU architecture implements a form of parallelism on a single chip, thereby allowing the system as a whole to ...
 processors. Instructions can be grouped together only if there is no data dependencyData dependency

A data dependency in computer science is a situation whereby computer instructions refer to the results of preceding instruc...
 between them. ScoreboardingScoreboarding

Scoreboarding is a centralized method of dynamically scheduling a pipeline so that instructions can execute out of order whe...
 and the Tomasulo algorithmTomasulo algorithm

The Tomasulo algorithm is an algorithm developed by Robert Tomasulo from IBM to execute instructions out of order....
 (which is similar to scoreboarding but makes use of register renamingRegister renaming

In computer engineering, register renaming refers to a technique used...
) are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism.

Data parallelism

Data parallelism is parallelism inherent in program loopsControl flow

In computer science control flow refers to the order in which the individual statements or instructions of an imperative pr...
, which focuses on distributing the data across different computing nodes to be processed in parallel. "Parallelizing loops often leads to similar (not necessarily identical) operation sequences or functions being performed on elements of a large data structure." Many scientific and engineering applications exhibit data parallelism.

A loop-carried dependency is the dependence of a loop iteration on the output of one or more previous iterations. Loop-carried dependencies prevent the parallelization of loops. For example, consider the following pseudocodePseudocode

Pseudocode is a description of a computer programming algorithm that uses the structural conventions of programming languag...
 that computes the first few Fibonacci numberFibonacci number

In mathematics, the Fibonacci numbers form a sequence defined recursively by:...
s:

1: PREV2 := 0
2: PREV1 := 1
3: CUR := 1
4: do:
5: CUR := PREV1 + PREV2
6: PREV2 := PREV1
7: PREV1 := CUR
8: while (CUR < 10)

This loop cannot be parallelized because CUR depends on itself (PREV1) and PREV2, which are computed in each loop iteration. Since each iteration depends on the result of the previous one, they cannot be performed in parallel. As the size of a problem gets bigger, the amount of data-parallelism available usually does as well.

Task parallelism

Task parallelism is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data". This contrasts with data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism does not usually scale with the size of a problem.

Hardware

Memory and communication

Main memory in a parallel computer is either shared memoryShared memory

In HardwareIn computer hardware, shared memory refers to a large block of random access memory that can be accessed by seve...
 (shared between all processing elements in a single address spaceAddress space

In computing, an address space defines a context in which a memory address makes sense....
), or distributed memoryDistributed memory

Distributed memory is a concept used in parallel computing....
 (in which each processing element has its own local address space). Distributed memory refers to the fact that the memory is logically distributed, but often implies that it is physically distributed as well. Distributed shared memoryDistributed shared memory

Distributed Shared Memory, in computer science, refers to a wide class of software and hardware implementations, in which ea...
 is a combination of the two approaches, where the processing element has its own local memory and access to the memory on non-local processors. Accesses to local memory are typically faster than accesses to non-local memory.

Computer architectures in which each element of main memory can be accessed with equal latencyMemory latency

In computing, memory latency is the time between initiating a request for a byte or word in memory until it is retrieved....
 and bandwidthBandwidth (computing)

In computer networking and computer science, digital bandwidth or just bandwidth often refers to a bit rate measured i...
 are known as Uniform Memory AccessUniform Memory Access

Uniform Memory Access is a computer memory architecture used in parallel computers having multiple processors and probably m...
 (UMA) systems. Typically, that can be achieved only by a shared memoryShared memory

In HardwareIn computer hardware, shared memory refers to a large block of random access memory that can be accessed by seve...
 system, in which the memory is not physically distributed. A system that does not have this property is known as a Non-Uniform Memory AccessNon-Uniform Memory Access

Non-Uniform Memory Access and Non-Uniform Memory Architecture is a computer memory design used in multiprocessors, wh...
 (NUMA) architecture. Distributed memory systems have non-uniform memory access.

Computer systems make use of cacheCache

In computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, wher...
s—small, fast memories located close to the processor which store temporary copies of memory values (nearby in both the physical and logical sense). Parallel computer systems have difficulties with caches that may store the same value in more than one location, with the possibility of incorrect program execution. These computers require a cache coherencyCache coherency Summary

Cache coherence refers to the integrity of data stored in local caches of a shared resource....
 system, which keeps track of cached values and strategically purges them, thus ensuring correct program execution. Bus snoopingBus sniffing

Bus sniffing or Bus snooping is a technique used in distributed shared memory systems and multiprocessors aimed at ach...
 is one of the most common methods for keeping track of which values are being accessed (and thus should be purged). Designing large, high-performance cache coherence systems is a very difficult problem in computer architecture. As a result, shared-memory computer architectures do not scale as well as distributed memory systems do.

Processor–processor and processor–memory communication can be implemented in hardware in several ways, including via shared (either multiported or multiplexedMultiplexing

Multiplexing is a term used in electrical engineering to refer to a process where multiple sources of information are combi...
) memory, a crossbar switchCrossbar switch

A crossbar switch is one of the principal architectures used to construct switches of many types....
, a shared bus or an interconnect network of a myriad of topologiesNetwork topology

A network topology is the pattern of links connecting pairs of nodes of a ....
 including starStar network Overview

Star networks are one of the most common computer network topologies....
, ringRing network Overview

A ring network is a topology of computer networks where each node is connected to two other nodes, so as to create a ring....
, treeTree (graph theory)

In graph theory, a tree is a graph in which any two vertices are connected by exactly one path....
, hypercubeHypercube graph

In the mathematical field of graph theory, the hypercube graph Qn is a special regular graph with 2n vertices, which corres...
, fat hypercube (a hypercube with more than one processor at a node), or n-dimensional meshMesh networking

Mesh networking is a way to route data, voice and instructions between nodes....
.

Parallel computers based on interconnect networks need to have some kind of routingRouting

In computer networking the term routing refers to selecting paths in a computer network along which to send data....
 to enable the passing of messages between nodes that are not directly connected. The medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines.

Classes of parallel computers

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common.
Multicore computing
A multicore processor is a processor that includes multiple execution unitExecution unit

In computer engineering, an execution unit is a part of a CPU that performs the operations and calculations called for by th...
s ("cores"). These processors differ from superscalar processors, which can issue multiple instructions per cycle from one instruction stream (thread); by contrast, a multicore processor can issue multiple instructions per cycle from multiple instruction streams. Each core in a multicore processor can potentially be superscalar as well—that is, on every cycle, each core can issue multiple instructions from one instruction stream.

Simultaneous multithreadingSimultaneous multithreading

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of supersca...
 (of which Intel's HyperThreading is the best known) was an early form of pseudo-multicoreism. A processor capable of simultaneous multithreading has only one execution unit ("core"), but when that execution unit is idling (such as during a cache miss), it uses that execution unit to process a second thread. Intel's CoreIntel Core

Intel Core is the name used for the processor codenamed Yonah, released on January 5 2006....
 and Core 2Intel Core 2

Core 2 is a ninth-generation x86 architecture microprocessor produced by Intel based on an all-new CPU architecture called t...
 processor families are Intel's first true multicore architectures. IBMIBM

company_name = International Business Machines Corporation |...
's Cell microprocessor, designed for use in the SonySony Summary

is a Japanese multinational corporation and one of the world's largest media conglomerates....
 Playstation 3PlayStation 3

The is Sony's seventh generation era video game console, third in the PlayStation series....
, is another prominent multicore processor.
Symmetric multiprocessing
A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus. Bus contentionBus contention

Bus contention is an undesirable state of the bus of a computer, in which more than one memory mapped device or the CPU is a...
 prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. "Because of the small size of the processors and the significant reduction in the requirements for bus bandwidth achieved by large caches, such symmetric multiprocessors are extremely cost-effective, provided that a sufficient amount of memory bandwidth exists."
Distributed computing
A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.
Cluster computing

A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer. Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, load balancingLoad balancing (computing)

In computing, load balancing is a technique to spread work between many computers, processes, disks or other resources in o...
 is more difficult if they are not. The most common type of cluster is the Beowulf clusterBeowulf (computing)

Beowulf is a design for high-performance parallel computing clusters on inexpensive personal computer hardware....
, which is a cluster implemented on multiple identical commercial off-the-shelfFacts About Commercial off-the-shelf

Commercial off-the-shelf is a term for software or hardware products that are ready-made and available for sale to the gener...
 computers connected with a TCP/IP EthernetEthernet Summary

Ethernet is a large and diverse family of frame-based computer networking technologies for local area networks ....
 local area networkLocal area network

A local area network is a computer network covering a local area, like a home, office, or group of buildings....
. Beowulf technology was originally developed by Thomas SterlingThomas Sterling (computing)

Thomas Sterling is a colleague of Don Becker and co-author of the original Beowulf Howto....
 and Donald BeckerDonald Becker

Donald Becker is a notable developer well known for writing many of the Ethernet drivers for the Linux operating system....
. The vast majority of the TOP500TOP500 Summary

The TOP500 project ranks and details the 500 most powerful publicly-known computer systems in the world....
 supercomputers are clusters.
Massive parallel processing
A massively parallel processor (MPP) is a single computer with many networked processors. MPPs have many of the same characteristics as clusters, but they are usually larger, typically having "far more" than 100 processors. In an MPP, "each CPU contains its own memory and copy of the operating system and application. Each subsystem communicates with the others via a high-speed interconnect."

Blue Gene/L, the second fastest supercomputer in the world according to the TOP500 ranking, is an MPP.
Grid computing
Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the InternetInternet

The Internet is the worldwide, publicly accessible network of interconnected computer networks that transmit data by packet ...
 to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, grid computing typically deals only with embarrassingly parallel problems. Many grid computing applicationsList of distributed computing projects

A list of distributed computing projects....
 have been created, of which SETI@homeSETI@home

SETI@home is a distributed computing project using Internet-connected computers, hosted by the Space Sciences Laboratory, at...
 and Folding@HomeFolding@home

Folding@home is a distributed computing project designed to perform computationally intensive simulations of protein folding...
 are the best-known examples.

Most grid computing applications use middlewareMiddleware

Middleware is computer software that connects software components or applications....
, software that sits between the operating system and the application to manage network resources and standardize the software interface. The most common grid computing middleware is the Berkeley Open Infrastructure for Network ComputingBerkeley Open Infrastructure for Network Computing

The Berkeley Open Infrastructure for Network Computing is a distributed computing infrastructure, originally developed out o...
 (BOINC). Often, grid computing software makes use of "spare cycles", performing computations at times when a computer is idling.

Software

Parallel programming languages

, libraries, APIsApplication programming interface

An application programming interface is the interface that a computer system, library or application provides in order to a...
, and parallel programming modelParallel programming model

A parallel programming model is a set of software technologies to express parallel algorithms and match applications with th...
s have been created for programming parallel computers. These can generally be divided into classes based on the assumptions they make about the underlying memory architecture—shared memory, distributed memory, or shared distributed memory. Shared memory programming languages communicate by manipulating shared memory variables. Distributed memory uses message passingMessage passing

In computer science, message passing is a form of communication used in concurrent programming, parallel programming, object...
. POSIX ThreadsPOSIX Threads

POSIX Threads is a POSIX standard for threads....
 and OpenMPOpenMP

The OpenMP application programming interface supports multi-platform shared memory multiprocessing programming in C/C++ and ...
 are two of most widely used shared memory APIs, whereas Message Passing InterfaceMessage Passing Interface

The Message Passing Interface is a computer communications protocol....
 (MPI) is the most widely used message-passing system API. One concept used in programming parallel programs is the future concept, where one part of a program promises to deliver a required datum to another part of a program at some future time.

Automatic parallelization

Automatic parallelization of a sequential program by a compilerFacts About Compiler

A compiler is a computer program that translates text written in a computer language into another computer language ....
 is the "holy grailHoly Grail

In Christian mythology, the Holy Grail was the dish, plate, or cup used by Jesus at the Last Supper, said to possess miracu...
" of parallel computing. Despite decades of work by compiler researchers, automatic parallelization has had only limited success.

Mainstream parallel programming languages remain either explicitly parallelExplicit parallelism Summary

In computer programming, explicit parallelism is the representation...
 or (at best) partially implicitImplicit parallelism

In computer science, implicit parallelism is a characteristic of a programming language that allows a compiler to automati...
, in which a programmer gives the compiler directivesDirective (programming)

A directive is an instruction to a programming language compiler about how to compile a program....
 for parallelization. A few fully implicit parallel programming languages exist—SISALSISAL

SISAL is a general-purpose single assignment functional programming language with strict semantics, automatic parallelisati...
, Parallel HaskellHaskell (programming language)

Haskell is a standardized pure functional programming language with non-strict semantics, named after the logician Haskell C...
, and (for FPGAs) Mitrion-C—but these are niche languages that are not widely used.

Application checkpointing

The larger and more complex a computer, the more that can go wrong and the shorter the mean time between failures. Application checkpointingApplication checkpointing

Checkpointing is a technique for inserting fault tolerance into computing systems....
 is a technique whereby the computer system takes a "snapshot" of the application—a record of all current resource allocations and variable states, akin to a core dumpCore dump

In computing, a core dump is a record of the raw, unstructured contents of one or more regions of working memory at a specif...
; this information can be used to restore the program if the computer should fail. Application checkpointing means that the program has to restart from only its last checkpoint rather than the beginning. For an application that may run for months, that is critical. Application checkpointing may be used to facilitate process migrationProcess migration Overview

Process migration is when processes in computer clusters are able to move from machine to machine....
.

Applications

As parallel computers become larger and faster, it becomes feasible to solve problems that previously took too long to run. Parallel computing is used in a wide range of fields, from bioinformaticsBioinformatics

Bioinformatics and computational biology involve the use of techniques including applied mathematics, informatics, sta...
 (to do protein foldingFacts About Protein folding

Protein folding is the process by which a protein assumes its characteristic functional shape or tertiary structure, also kn...
) to economics (to do simulation in mathematical financeMathematical finance

Mathematical finance is the branch of applied mathematics concerned with the financial markets....
). Common types of problems found in parallel computing applications are:

  • Dense linear algebraLinear algebra

    Linear algebra is the branch of mathematics concerned with the study of vectors, vector spaces , linear transformations, and...
  • Sparse linear algebra
  • Spectral methods (such as Cooley-Tukey Fast Fourier transformCooley-Tukey FFT algorithm

    The Cooley-Tukey algorithm, named after J.W....
    )
  • N-body problemsN-body simulation

    An N-body simulation is a simulation of massive particles under the influence of physical forces, usually gravity and so...
     (such as Barnes-Hut simulationBarnes-Hut simulation

    The Barnes-Hut simulation is an algorithm for performing an N-body simulation....
    )
  • Structured gridRegular grid

    A regular grid is a grid that has a regular topology and a regular geometry....
     problems (such as Lattice Boltzmann methodsFacts About Lattice Boltzmann methods

    Instead of solving the Navier-Stokes equations, the discrete Boltzmann equation is solved to simulate the flow of Newtonian fluid....
    ),
  • Unstructured gridUnstructured grid

    An unstructured grid is a grid that has no regular topology....
     problems (such as found in finite element analysisFinite element analysis

    Finite Element Analysis is a computer simulation technique used in engineering analysis....
    )
  • Monte Carlo simulationMonte Carlo method

    Monte Carlo methods are a widely used class of computational algorithms for simulating the behavior of various physical and ...
  • Combinational logicCombinational logic

    ----In digital circuit theory, combinational logic is a type of logic circuit whose output is a function of the present inp...
     (such as brute-force cryptographic techniquesBrute force attack

    In cryptanalysis, a brute force attack is a method of defeating a cryptographic scheme by trying a large number of possibili...
    )
  • Graph traversalGraph traversal

    Graph traversal refers to the problem of visiting all the nodes in a graph in a particular manner....
     (such as Sorting algorithmSorting algorithm

    In computer science and mathematics, a sorting algorithm is an algorithm that puts elements of a list in a certain order....
    s)
  • Dynamic programmingDynamic programming

    In computer science, dynamic programming is a method for reducing the runtime of algorithms exhibiting the properties of ove...
  • Branch and boundBranch and bound

    Branch and bound is a general algorithmic method for finding optimal solutions of various optimization problems, especially ...
     methods
  • Graphical modelGraphical model

    In probability theory and statistics, a graphical model represents dependencies among random variables by a graph in which e...
    s (such as detecting Hidden Markov modelHidden Markov model

    A hidden Markov model is a statistical model where the system being modeled is assumed to be a Markov process with unknown p...
    s and constructing Bayesian networkBayesian network

    A Bayesian network is a form of probabilistic graphical model, also known as Bayesian belief network or just belief...
    s)
  • Finite State MachineFinite state machine Overview

    A finite state machine or finite automaton is a model of behavior composed of states, transitions and actions....
     simulation

History


The origins of true (MIMD) parallelism go back to Federico Luigi, Conte MenabreaFederico Luigi, Conte Menabrea

Federico Luigi, Conte Menabrea, Marquis of Valdora was an Italian general and statesman. ...
 and his "Sketch of the Analytic Engine Invented by Charles BabbageCharles Babbage

Charles Babbage was an English mathematician, analytical philosopher, mechanical engineer and computer scientist who origi...
". IBMIBM

company_name = International Business Machines Corporation |...
 introduced the 704 in 1954, through a project in which Gene AmdahlGene Amdahl

Gene Myron Amdahl is a Norwegian American computer architect and hi-tech entrepreneur, chiefly known for his work on mainfra...
 was one of the principal architects. It became the first commercially available computer to use fully automatic floating pointFloating point

Floating-point is a means of representing real numbers in terms of digits or bits in a computer or calculator, similar to ho...
 arithmetic commands. In 1958, IBM researchers John CockeJohn Cocke

John Cocke was an American computer scientist recognised for his large contribution to computer architecture and optimizing ...
 and Daniel Slotnick discussed the use of parallelism in numerical calculations for the first time. Burroughs Corporation introduced the D825 in 1962, a four-processor computer that accessed up to 16 memory modules through a crossbar switchCrossbar switch

A crossbar switch is one of the principal architectures used to construct switches of many types....
. In 1967, Amdahl and Slotnick published a debate about the feasibility of parallel processing at American Federation of Information Processing Societies Conference. It was during this debate that Amdahl's Law was coined to define the limit of speed-up due to parallelism.

In 1969, US company HoneywellHoneywell

Honeywell is a major American multinational corporation that produces electronic control systems and automation equipment....
 introduced its first Multics system, a symmetric multiprocessor system capable of running up to eight processors in parallel. C.mmpFacts About C.mmp

The C.mmp was an early MIMD multiprocessor system developed at Carnegie Mellon University by W.A.Wulf....
, a 1970s multi-processor project at Carnegie Mellon UniversityCarnegie Mellon University

Carnegie Mellon University is a private research university located in Pittsburgh, Pennsylvania....
, was "among the first multiprocessors with more than a few processors". "The first bus-connected multi-processor with snooping caches was the Synapse N+1 in 1984."

SIMD parallel computers can be traced back to the 1970s. The motivation behind early SIMD computers was to amortize the gate delayFacts About Propagation delay

In electronics and digital circuits, the propagation delay is the amount of time starting from when the input to a logic gat...
 of the processor's control unitControl unit

A control unit is the part of a CPU or other device that directs its operation....
 over multiple instructions. In 1964, Slotnick had proposed building a massively parallel computer for the Lawrence Livermore National LaboratoryLawrence Livermore National Laboratory

The Lawrence Livermore National Laboratory is a United States Department of Energy national laboratory, managed and operat...
. His design was funded by the US Air Force, which was the earliest SIMD parallel-computing effort, ILLIAC IVILLIAC IV

*ORDVAC*ILLIAC I*ILLIAC II*ILLIAC III...
. The key to its design was a fairly high parallelism, with up to 256 processors, which allowed the machine to work on large datasets in what would later be known as vector processingVector processor

A vector processor, or array processor, is a CPU design that is able to run mathematical operations on multiple data e...
. However, ILLIAC IV was called "the most infamous of Supercomputers", because the project was only one fourth completed, but took 11 years and cost almost four times the original estimate. When it was finally ready to run its first real application in 1976, it was outperformed by existing commercial supercomputers such as the Cray-1Cray-1

The Cray-1 was a supercomputer designed by a team including Seymour Cray for Cray Research....
.

External links