Memory barrier - AbsoluteAstronomy.com

Memory barrier, also known as membar or memory fence or fence instruction, is a type of barrier

Barrier (computer science)

- Threads synchronization primitive :In parallel computing, a barrier is a type of synchronization method. A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier.Many...

and a class of instruction which causes a central processing unit

Central processing unit

The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

(CPU) or compiler

Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language...

to enforce an ordering constraint on memory operations issued before and after the barrier instruction.

CPUs employ performance optimizations that can result in out-of-order execution

Out-of-order execution

In computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...

. The reordering of memory operations (loads and stores) normally goes unnoticed within a single thread of execution

Thread (computer science)

In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

, but causes unpredictable behaviour in concurrent programs

Concurrent computing

Concurrent computing is a form of computing in which programs are designed as collections of interacting computational processes that may be executed in parallel...

and device driver

Device driver

In computing, a device driver or software driver is a computer program allowing higher-level computer programs to interact with a hardware device....

s unless carefully controlled. The exact nature of an ordering constraint is hardware dependent, and defined by the architecture's memory ordering

Memory ordering

Memory ordering is a group of properties of the modern microprocessors, characterising their possibilities in memory operations reordering. It is a type of out-of-order execution. Memory reordering can be used to fully utilize different cache and memory banks.On most modern uniprocessors memory...

model. Some architectures provide multiple barriers for enforcing different ordering constraints.

Memory barriers are typically used when implementing low-level machine code

Machine code

Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...

that operates on memory shared by multiple devices. Such code includes synchronization

Synchronization (computer science)

In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. Process synchronization refers to the idea that multiple processes are to join up or handshake at a certain point, so as to reach an agreement or...

primitives and lock-free

Non-blocking synchronization

In computer science, a non-blocking algorithm ensures that threads competing for a shared resource do not have their execution indefinitely postponed by mutual exclusion...

data structures on multiprocessor

Multiprocessing

Multiprocessing is the use of two or more central processing units within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them...

systems, and device drivers that communicate with computer hardware.

An illustrative example

When a program runs on a single CPU, the hardware performs the necessary bookkeeping to ensure that programs execute as if all memory operations were performed in the order specified by the programmer (program order), hence memory barriers are not necessary. However, when the memory is shared with multiple devices, such as other CPUs in a multiprocessor system, or memory mapped

Memory-mapped I/O

Memory-mapped I/O and port I/O are two complementary methods of performing input/output between the CPU and peripheral devices in a computer...

peripherals, out-of-order access may affect program behavior. For example, a second CPU may see memory changes made by the first CPU in a sequence which differs from program order.

The following two-processor program gives a concrete example of how such out-of-order execution can affect program behavior:

Initially, memory locations x and f both hold the value 0. The program running on processor #1 loops while the value of f is zero, then it prints the value of x. The program running on processor #2 stores the value 42 into x and then stores the value 1 into f. Pseudo-code for the two program fragments is shown below. The steps of the program correspond to individual processor instructions.



Processor #1:

 while f  0

  ;

 // Memory fence required here

 print x;



Processor #2:

 x = 42;

 // Memory fence required here

 f = 1;

One might expect the print statement to always print the number "42"; however, if processor #2's store operations are executed out-of-order, it is possible for f to be updated before x, and the print statement might therefore print "0". Similarly, processor #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked, and again the print statement might therefore print an unexpected value. For most programs neither of these situation are acceptable. A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f. Another can be inserted before processor #1's access to x to ensure the value of x is not read prior to seeing the change in the value of f.

For another illustrative example (a non-trivial one that arises in actual practice), see double-checked locking

Double-checked locking

In software engineering, double-checked locking is a software design pattern used to reduce the overhead of acquiring a lock by first testing the locking criterion without actually acquiring the lock...

.
Low-level architecture-specific primitives
Memory barriers are low-level primitives which are part of the definition of an architecture's memory model

Memory ordering

. Like instruction sets, memory models vary considerably between architectures, so it is not appropriate to generalize about memory barrier behavior. The conventional wisdom is that using memory barriers correctly requires careful study of the architecture manuals for the hardware being programmed. That said, the following paragraph offers a glimpse of some memory barriers which exist in contemporary products.

Some architectures, including the ubiquitous x86/x64

X86 instruction listings

The x86 instruction set has been extended several times, introducing wider registers and datatypes and/or new functionality.-x86 integer instructions:...

, provide several memory barrier instructions including an instruction sometimes called "full fence". A full fence ensures that all load and store operations prior to the fence will have been committed prior to any loads and stores issued following the fence. Other architectures, such as the Itanium

Itanium

Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...

, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively. Some architectures provide separate memory barriers to control ordering between different combinations of system memory and I/O

I/O

I/O may refer to:* Input/output, a system of communication for information processing systems* Input-output model, an economic model of flow prediction between sectors...

memory. When more than one memory barrier instruction is available it is important to consider that the cost of different instructions may vary considerably.
Multithreaded programming and memory visibility
Multithreaded programs usually use synchronization primitives

Language primitive

In computing, language primitives are the simplest elements available in a programming language. A primitive can be defined as the smallest 'unit of processing' available to a programmer of a particular machine, or can be an atomic element of an expression in a language.-Machine level primitives:A...

provided by a high-level programming environment, such as Java

Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

and .NET Framework

.NET Framework

The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

, or an application programming interface

Application programming interface

An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

(API) such as POSIX Threads

POSIX Threads

POSIX Threads, usually referred to as Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions , defines an API for creating and manipulating threads....

or Windows API

Windows API

The Windows API, informally WinAPI, is Microsoft's core set of application programming interfaces available in the Microsoft Windows operating systems. It was formerly called the Win32 API; however, the name "Windows API" more accurately reflects its roots in 16-bit Windows and its support on...

. Primitives such as mutexes

Mutual exclusion

Mutual exclusion algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections. A critical section is a piece of code in which a process or thread accesses a common resource...

and semaphores

Semaphore (programming)

In computer science, a semaphore is a variable or abstract data type that provides a simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment....

are provided to synchronize access to resources from parallel threads of execution. These primitives are usually implemented with the memory barriers required to provide the expected memory visibility semantics. In such environments explicit use of memory barriers is not generally necessary.

Each API or programming environment in principle has its own high-level memory model that defines its memory visibility semantics. Although programmers do not usually need to use memory barriers in such high level environments, it is important to understand their memory visibility semantics, to the extent possible. Such understanding is not necessarily easy to achieve because memory visibility semantics are not always consistently specified or documented.

Just as programming language semantics are defined at a different level of abstraction

Abstraction layer

An abstraction layer is a way of hiding the implementation details of a particular set of functionality...

than machine language opcode

Opcode

In computer science engineering, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Their specification and format are laid out in the instruction set architecture of the processor in question...

s, a programming environment's memory model is defined at a different level of abstraction than that of a hardware memory model. It is important to understand this distinction and realize that there is not always a simple mapping between low-level hardware memory barrier semantics and the high-level memory visibility semantics of a particular programming environment. As a result, a particular platform's implementation of (say) POSIX Threads

POSIX Threads

POSIX Threads, usually referred to as Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions , defines an API for creating and manipulating threads....

may employ stronger barriers than required by the specification. Programs which take advantage of memory visibility as-implemented rather than as-specified may not be portable.
Out-of-order execution versus compiler reordering optimizations
Memory barrier instructions only address reordering effects at the hardware level. Compilers may also reorder instructions as part of the program optimization process. Although the effects on parallel program behavior can be similar in both cases, in general it is necessary to take separate measures to inhibit compiler reordering optimizations for data that may be shared by multiple threads of execution. Note that such measures are usually only necessary for data which is not protected by synchronization primitives such as those discussed in the prior section.

In C

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

and C++

C++

C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

, the volatile keyword was intended to allow C and C++ programs to directly access memory-mapped I/O

Memory-mapped I/O

Memory-mapped I/O and port I/O are two complementary methods of performing input/output between the CPU and peripheral devices in a computer...

. Memory-mapped I/O generally requires that the reads and writes specified in source code happen in the exact order specified with no omissions. Omissions or reorderings of reads and writes by the compiler would break the communication between the program and the device accessed by memory-mapped I/O. A C or C++ compiler may not reorder reads and writes to volatile memory locations, nor may it omit a read or write to a volatile memory location, allowing a pointer to volatile memory to be used for memory-mapped I/O.

The C and C++ standards do not address multiple threads (or multiple processors), and as such, the usefulness of volatile depends on the compiler and hardware. Although volatile guarantees that the volatile reads and volatile writes will happen in the exact order specified in the source code, the compiler may generate code (or the CPU may re-order execution) such that a volatile read or write is reordered with regard to non-volatile reads or writes, thus limiting its usefulness as an inter-thread flag or mutex. Preventing such is compiler specific, but some compilers, like gcc

GNU Compiler Collection

The GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain...

, will not reorder operations around in-line assembly code with volatile and "memory" tags, like in: asm volatile ("" : : : "memory"); (See more examples in compiler memory barrier). Moreover, it is not guaranteed that volatile reads and writes will be seen in the same order by other processors due to caching, cache coherence

Cache coherence

In computing, cache coherence refers to the consistency of data stored in local caches of a shared resource.When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system...

protocol and relaxed memory ordering

Memory ordering

, meaning volatile variables alone may not even work as inter-thread flags or mutexes.

Some languages and compilers may provide sufficient facilities to implement functions which address both the compiler reordering and machine reordering

Out-of-order execution

issues. In Java

Java (programming language)

version 1.5 (also known as version 5), the volatile keyword is now guaranteed to prevent certain hardware and compiler re-orderings, as part of the new Java Memory Model

Java Memory Model

The Java memory model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language....

. The proposed C++

C++

memory model does not use volatile, instead C++0x

C++0x

C++11, also formerly known as C++0x, is the name of the most recent iteration of the C++ programming language, replacing C++03, approved by the ISO as of 12 August 2011...

will include special atomic types and operations with semantics similar to those of volatile in the Java Memory Model.
External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.