Multithreading (computer hardware) - AbsoluteAstronomy.com

Multithreading computers have hardware support to efficiently execute multiple threads

Thread (computer science)

In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

. These are distinguished from multiprocessing

Multiprocessing

Multiprocessing is the use of two or more central processing units within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them...

systems (such as multi-core systems) in that the threads have to share the resources of a single core: the computing units, the CPU cache

CPU cache

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...

s and the translation lookaside buffer

Translation Lookaside Buffer

A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...

(TLB). Where multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using thread-level as well as instruction-level parallelism. As the two techniques are complementary, they are sometimes combined in systems with multiple multithreading CPUs and in CPUs with multiple multithreading cores.

Overview

The multithreading paradigm

Paradigm

The word paradigm has been used in science to describe distinct concepts. It comes from Greek "παράδειγμα" , "pattern, example, sample" from the verb "παραδείκνυμι" , "exhibit, represent, expose" and that from "παρά" , "beside, beyond" + "δείκνυμι" , "to show, to point out".The original Greek...

has become more popular as efforts to further exploit instruction level parallelism

Instruction level parallelism

Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f...

have stalled since the late-1990s. This allowed the concept of throughput computing to re-emerge to prominence from the more specialized field of transaction processing

Transaction processing

In computer science, transaction processing is information processing that is divided into individual, indivisible operations, called transactions. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state...

Even though it is very difficult to further speed up a single thread or single program, most computer systems are actually multi-tasking among multiple threads or programs.
Techniques that would allow speed up of the overall system throughput of all tasks would be a meaningful performance gain.

The two major techniques for throughput computing are multiprocessing

Multiprocessing

and multithreading.

Advantages

Some advantages include:

If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
If a thread cannot use all the computing resources of the CPU (because instructions depend on each other's result), running another thread permits to not leave these idle.
If several threads work on the same set of data, they can actually share their cache, leading to better cache usage or synchronization on its values.

Disadvantages

Some criticisms of multithreading include:

Multiple threads can interfere with each other when sharing hardware resources such as cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...

s or translation lookaside buffer
Translation Lookaside Buffer
A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...

s (TLBs).
Execution times of a single thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
Hardware support for multithreading is more visible to software, thus requiring more changes to both application programs and operating systems than Multiprocessing.

The mileage thus varies; Intel claims up to 30 percent improvement with its HyperThreading technology , while a synthetic program just performing a loop of non-optimized dependent floating-point operations actually gains a 100 percent speed improvement when run in parallel. On the other hand, hand-tuned assembly language

Assembly language

An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...

programs using MMX or Altivec

AltiVec

AltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...

extensions and performing data pre-fetches (as a good video encoder might), do not suffer from cache misses or idle computing resources. Such programs therefore do not benefit from hardware multithreading and can indeed see degraded performance due to contention for shared resources.

Hardware techniques used to support multithreading

Thread (computer science)

often parallel the software techniques used
for computer multitasking

Computer multitasking

In computing, multitasking is a method where multiple tasks, also known as processes, share common processing resources such as a CPU. In the case of a computer with a single CPU, only one task is said to be running at any point in time, meaning that the CPU is actively executing instructions for...

of computer programs.

thread scheduling is also a major problem in multithreading.

Concept

The simplest type of multi-threading occurs when one thread runs until it is blocked by an
event that normally would create a long latency stall. Such a stall might be a cache-miss that has to
access off-chip memory, which might take hundreds of CPU cycles for the data to return.
Instead of waiting for the stall to resolve, a threaded processor would switch execution to another
thread that was ready to run. Only when the data for the previous thread had arrived, would the previous
thread be placed back on the list of ready-to-run threads.

For example:

Cycle i : instruction j from thread A is issued
Cycle i+1: instruction j+1 from thread A is issued
Cycle i+2: instruction j+2 from thread A is issued, load instruction which misses in all caches
Cycle i+3: thread scheduler invoked, switches to thread B
Cycle i+4: instruction k from thread B is issued
Cycle i+5: instruction k+1 from thread B is issued

Conceptually, it is similar to cooperative multi-tasking used in real-time operating systems in which
tasks voluntarily give up execution time when they need to wait upon some type of the event.

Terminology

This type of multi threading is known as Block or Cooperative or Coarse-grained multithreading.

Hardware cost

The goal of multi-threading hardware support is to allow quick switching between a blocked
thread and another thread ready to run. To achieve this goal, the hardware cost is to
replicate the program visible registers as well as some processor control registers (such as the program counter).
Switching from one thread to another thread means the hardware switches from using one register set to another.

Such additional hardware has these benefits:

The thread switch can be done in one CPU cycle.
It appears to each thread that it is executing alone and not sharing any hardware resources with any other threads. This minimizes the amount of software changes needed within the application as well as the operating system to support multithreading.

In order to switch efficiently between active threads, each active thread needs to have its own
register set. For example, to quickly switch between two threads, the register hardware needs to be instantiated twice.

Examples

Many families of microcontroller
Microcontroller
A microcontroller is a small computer on a single integrated circuit containing a processor core, memory, and programmable input/output peripherals. Program memory in the form of NOR flash or OTP ROM is also often included on chip, as well as a typically small amount of RAM...

s and embedded processors have multiple register banks to allow quick context switch
Context switch
A context switch is the computing process of storing and restoring the state of a CPU so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU. The context switch is an essential feature of a multitasking operating system...

ing for interrupts. Such schemes can be considered a type of block multithreading among the user program thread and the interrupt threads.

Interleaved multi-threading

Cycle i+1: an instruction from thread B is issued
Cycle i+2: an instruction from thread C is issued

The purpose of this type of multithreading is to remove all data dependency

Data dependency

A data dependency in computer science is a situation in which a program statement refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis.There are three types of dependencies: data, name, and...

stalls from the execution pipeline. Since one thread is relatively independent from other threads, there's less chance of one instruction in one pipe stage needing an output from an older instruction in the pipeline.

Conceptually, it is similar to pre-emptive

Preemption (computing)

In computing, preemption is the act of temporarily interrupting a task being carried out by a computer system, without requiring its cooperation, and with the intention of resuming the task at a later time. Such a change is known as a context switch...

multi-tasking used in operating systems. One can make the analogy that the time-slice given to each active thread is one CPU cycle.

Terminology

This type of multithreading was first called Barrel processing, in which the staves
of a barrel represent the pipeline stages and their executing threads. Interleaved or Pre-emptive or Fine-grained or time-sliced multithreading are more modern terminology.

Hardware costs

In addition to the hardware costs discussed in the Block type of multithreading, interleaved multithreading has an additional cost of each pipeline stage tracking the thread ID of the instruction it is processing. Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to be larger to avoid thrashing between the different threads.

Concept

The most advanced type of multi-threading applies to superscalar

Superscalar

A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...

processors. A normal superscalar processor issues multiple instructions from a single thread every CPU cycle. In Simultaneous Multi-threading (SMT), the superscalar processor can issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction level parallelism

Instruction level parallelism

, this type of multithreading tries to exploit parallelism available across multiple threads to decrease the waste associated with unused issue slots.

For example:

Cycle i : instructions j and j+1 from thread A; instruction k from thread B all simultaneously issued
Cycle i+1: instruction j+2 from thread A; instruction k+1 from thread B; instruction m from thread C all simultaneously issued
Cycle i+2: instruction j+3 from thread A; instructions m+1 and m+2 from thread C all simultaneously issued

Terminology

To distinguish the other types of multithreading from SMT, the term Temporal multithreading

Temporal multithreading

Temporal multithreading is one of the two main forms of multithreading that can be implemented on computer processor hardware, the other being simultaneous multithreading. The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given...

is used to denote when instructions from only one thread can be issued at a time.

Hardware costs

In addition to the hardware costs discussed for interleaved multithreading, SMT has the additional cost of each pipeline stage tracking the Thread ID of each instruction being processed. Again, shared resources such as caches and TLBs have to be sized for the large number of active threads being processed.

Examples

DEC
Digital Equipment Corporation
Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...

(later Compaq
Compaq
Compaq Computer Corporation is a personal computer company founded in 1982. Once the largest supplier of personal computing systems in the world, Compaq existed as an independent corporation until 2002, when it was acquired for US$25 billion by Hewlett-Packard....

) EV8
Alpha 21464
The Alpha 21464 is an unfinished microprocessor that implements the Alpha instruction set architecture developed by Digital Equipment Corporation and later by Compaq after it acquired Digital. The microprocessor was also known as EV8 or Araña, the latter being its code-name...

(not completed)
Intel Hyper-Threading
Hyper-threading
Hyper-threading is Intel's term for its simultaneous multithreading implementation in its Atom, Intel Core i3/i5/i7, Itanium, Pentium 4 and Xeon CPUs....
IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

POWER5
POWER5
The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the highly successful POWER4. The principal improvements are support for simultaneous multithreading and an on-die memory controller...
Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...

UltraSPARC T2
UltraSPARC T2
Sun Microsystems' UltraSPARC T2 microprocessor is a multithreading, multi-core CPU. It is a member of the SPARC family, and the successor to the UltraSPARC T1. The chip is sometimes referred to by its codename, Niagara 2...
MIPS MT
MIPS architecture
MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
CRAY
Cray
Cray Inc. is an American supercomputer manufacturer based in Seattle, Washington. The company's predecessor, Cray Research, Inc. , was founded in 1972 by computer designer Seymour Cray. Seymour Cray went on to form the spin-off Cray Computer Corporation , in 1989, which went bankrupt in 1995,...

XMT
Cray XMT
The Cray XMT is the third generation of the Cray MTA supercomputer architecture originally developed by Tera. The earlier generations were called the Cray MTA and the Cray MTA-2. The XMT makes the MTA's multithreaded processors, now dubbed Threadstorm, compatible with the 1207-pin Socket F used...

Implementation specifics

A major area of research is the thread scheduler which must quickly choose among
the list of ready-to-run threads to execute next as well as maintain the ready-to-run and stalled thread lists.
An important sub-topic is the different thread priority schemes that can be used by the scheduler.
The thread scheduler might be implemented totally in software or totally in hardware or as a hw/sw combination.

Another area of research is what type of events should cause a thread switch - cache misses, inter-thread communication,
DMA

Direct memory access

Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....

completion, etc.

If the multithreading scheme replicates all software visible state, include privileged control registers, TLBs, etc., then it enables virtual machine

Virtual machine

A virtual machine is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software emulation or hardware virtualization or both together.-VM Definitions:A virtual machine is a software...

s to be created for each thread. This allows each thread to run its own operating system on the same processor. On the other hand, if only user-mode state is saved, less hardware is required which would allow for more threads to be active at one time for the same die-area/cost.

Overview

Advantages

Disadvantages

Concept

Terminology

Hardware cost

Examples

Interleaved multi-threading

Terminology

Hardware costs

Concept

Terminology

Hardware costs

Examples

Implementation specifics

See also