Garbage collection (computer science)

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, garbage collection (GC) is a form of automatic memory management

Memory management

Memory management is the act of managing computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and freeing it for reuse when no longer needed. This is critical to the computer system.Several...

. The garbage collector, or just collector, attempts to reclaim garbage
Garbage (computer science)
Garbage, in the context of computer science, refers to objects, data, or other regions of the memory of a computer system , which will not be used in any future computation by the system, or by a program running on it...

, or memory occupied by objects

Object (computer science)

In computer science, an object is any entity that can be manipulated by the commands of a programming language, such as a value, variable, function, or data structure...

that are no longer in use by the program

Application software

Application software, also known as an application or an "app", is computer software designed to help the user to perform specific tasks. Examples include enterprise software, accounting software, office suites, graphics software and media players. Many application programs deal principally with...

. Garbage collection was invented by John McCarthy

John McCarthy (computer scientist)

John McCarthy was an American computer scientist and cognitive scientist. He coined the term "artificial intelligence" , invented the Lisp programming language and was highly influential in the early development of AI.McCarthy also influenced other areas of computing such as time sharing systems...

around 1959 to solve problems in Lisp.

Garbage collection is often portrayed as the opposite of manual memory management

Manual memory management

In computer science, manual memory management refers to the usage of manual instructions by the programmer to identify and deallocate unused objects, or garbage. Up until the mid 1990s, the majority of programming languages used in industry supported manual memory management...

, which requires the programmer to specify which objects to deallocate and return to the memory system. However, many systems use a combination of approaches, including other techniques such as stack allocation and region inference.

Garbage collection does not traditionally manage limited resources other than memory that typical programs use, such as network sockets, database handles, user interaction windows, and file and device descriptors. Methods used to manage such resources, particularly destructors

Destructor (computer science)

In object-oriented programming, a destructor is a method which is automatically invoked when the object is destroyed...

, may suffice as well to manage memory, leaving no need for GC. Some GC systems allow such other resources to be associated with a region of memory that, when collected, causes the other resource to be reclaimed; this is called finalization

Finalizer

In object-oriented programming languages that use garbage collection, a finalizer is a special method that is executed when an object is garbage collected. It is similar in function to a destructor...

. Finalization may introduce complications limiting its usability, such as intolerable latency between disuse and reclaim of especially limited resources, or a lack of control over which thread performs the work of reclaiming.

Principles

The basic principles of garbage collection are:

Find data objects in a program that cannot be accessed in the future
Reclaim the resources used by those objects

Many computer language

Programming language

A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

s require garbage collection, either as part of the language specification (e.g., Java

Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

, C#, and most scripting language

Scripting language

A scripting language, script language, or extension language is a programming language that allows control of one or more applications. "Scripts" are distinct from the core code of the application, as they are usually written in a different language and are often created or at least modified by the...

s) or effectively for practical implementation (e.g., formal languages like lambda calculus

Lambda calculus

In mathematical logic and computer science, lambda calculus, also written as λ-calculus, is a formal system for function definition, function application and recursion. The portion of lambda calculus relevant to computation is now called the untyped lambda calculus...

); these are said to be garbage collected languages. Other languages were designed for use with manual memory management, but have garbage collected implementations available (e.g., C

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

, C++

C++

C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

). Some languages, like Ada

Ada (programming language)

Ada is a structured, statically typed, imperative, wide-spectrum, and object-oriented high-level computer programming language, extended from Pascal and other languages...

, Modula-3

Modula-3

In computer science, Modula-3 is a programming language conceived as a successor to an upgraded version of Modula-2 known as Modula-2+. While it has been influential in research circles it has not been adopted widely in industry...

, and C++/CLI

C++/CLI

C++/CLI is Microsoft's language specification intended to supersede Managed Extensions for C++. It is a complete revision that aims to simplify the older Managed C++ syntax . C++/CLI is standardized by Ecma as ECMA-372...

allow both garbage collection and manual memory management

Manual memory management

to co-exist in the same application by using separate heaps

Heap (data structure)

In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: if B is a child node of A, then key ≥ key. This implies that an element with the greatest key is always in the root node, and so such a heap is sometimes called a max-heap...

for collected and manually managed objects; others, like D

D (programming language)

The D programming language is an object-oriented, imperative, multi-paradigm, system programming language created by Walter Bright of Digital Mars. It originated as a re-engineering of C++, but even though it is mainly influenced by that language, it is not a variant of C++...

, are garbage collected but allow the user to manually delete objects and also entirely disable garbage collection when speed is required. While integrating garbage collection into the language's compiler

Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language...

and runtime system enables a much wider choice of methods, post hoc GC systems exist, including some that do not require recompilation. (Post-hoc GC is sometimes distinguished as litter collection.) The garbage collector will almost always be closely integrated with the memory allocator.

Benefits

Garbage collection frees the programmer from manually dealing with memory deallocation. As a result, certain categories of bugs are eliminated or substantially reduced:

Dangling pointer bugs, which occur when a piece of memory is freed while there are still pointers to it, and one of those pointers is dereferenced. By then the memory may have been re-assigned to another use, with unpredictable results.
Double free bugs, which occur when the program tries to free a region of memory that has already been freed, and perhaps already been allocated again.
Certain kinds of memory leak
Memory leak
A memory leak, in computer science , occurs when a computer program consumes memory but is unable to release it back to the operating system. In object-oriented programming, a memory leak happens when an object is stored in memory but cannot be accessed by the running code...

s, in which a program fails to free memory occupied by objects that have actually become unreachable, which leads to memory exhaustion if such a behavior is repeated indefinitely. (Garbage collection typically does not deal with the unbounded accumulation of data which is reachable, but which will actually not be used by the program.)

Some of the bugs addressed by garbage collection can have security implications.

Disadvantages

Typically, garbage collection has certain disadvantages:

Garbage collection consumes computing resources in deciding which memory to free, reconstructing facts that may have been known to the programmer. The penalty for the convenience of not annotating object lifetime manually in the source code is overhead
Computational overhead
In computer science, overhead is generally considered any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal...

, often leading to decreased or uneven performance. Interaction with memory hierarchy effects can make this overhead intolerable in circumstances that are hard to predict or to detect in routine testing.
The moment when the garbage is actually collected can be unpredictable, resulting in stalls scattered throughout a session. Unpredictable stalls can be unacceptable in real-time environments
Real-time computing
In computer science, real-time computing , or reactive computing, is the study of hardware and software systems that are subject to a "real-time constraint"— e.g. operational deadlines from event to system response. Real-time programs must guarantee response within strict time constraints...

, in transaction processing
Transaction processing
In computer science, transaction processing is information processing that is divided into individual, indivisible operations, called transactions. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state...

, or in interactive programs.
You can't have pointers, only references.
Non-deterministic GC is incompatible with RAII based management of non GCed resources. As a result, the need for explicit manual resource management (release/close) for non-GCed resources becomes transitive to composition. That is: in a non-deterministic GC system, if a resource or a resource like object requires manual resource management (release/close), and this object is used as 'part of' an other object, than the composed object will also become a resource like object that itself requires manual resource management (release/close).

Tracing garbage collectors

Tracing garbage collectors are the most common type of garbage collector. They first determine which objects are reachable (or potentially reachable), and then discard all remaining objects.

Reachability of an object

Informally, an object is reachable if it is referenced by at least one variable in the program, either directly or through references from other reachable objects. More precisely, objects can be reachable in only two ways:

A distinguished set of objects are assumed to be reachable: these are known as the roots. Typically, these include all the objects referenced from anywhere in the call stack
Call stack
In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program. This kind of stack is also known as an execution stack, control stack, run-time stack, or machine stack, and is often shortened to just "the stack"...

(that is, all local variables and parameters in the functions currently being invoked), and any global variables.
Anything referenced from a reachable object is itself reachable; more formally, reachability is a transitive closure
Transitive closure
In mathematics, the transitive closure of a binary relation R on a set X is the transitive relation R+ on set X such that R+ contains R and R+ is minimal . If the binary relation itself is transitive, then the transitive closure will be that same binary relation; otherwise, the transitive closure...

.

The reachability definition of "garbage" is not optimal, insofar as the last time a program uses an object could be long before that object falls out of the environment scope. A distinction is sometimes drawn between syntactic garbage, those objects the program cannot possibly reach, and semantic garbage, those objects the program will in fact never again use. For example:

Object x = new Foo;
Object y = new Bar;
x = new Quux;
/* at this point, we know that the Foo object
* originally assigned to x will never be
* accessed: it is syntactic garbage
*/

if(x.check_something) {
x.do_something(y);
}
System.exit(0);
/* in the above block, y *could* be semantic garbage,
* but we won't know until x.check_something returns
* some value: if it returns at all
*/

The problem of precisely identifying semantic garbage can easily be shown to be partially decidable

Decision problem

In computability theory and computational complexity theory, a decision problem is a question in some formal system with a yes-or-no answer, depending on the values of some input parameters. For example, the problem "given two numbers x and y, does x evenly divide y?" is a decision problem...

: a program that allocates an object X, runs an arbitrary input program P, and uses X if and only if P finishes would require a semantic garbage collector to solve the halting problem

Halting problem

In computability theory, the halting problem can be stated as follows: Given a description of a computer program, decide whether the program finishes running or continues to run forever...

. Although conservative heuristic methods for semantic garbage detection remain an active research area, essentially all practical garbage collectors focus on syntactic garbage.

Another complication with this approach is that, in languages with both reference type

Reference type

In programming language theory, a reference type is a data type that can only be accessed by references. Unlike objects of value types, objects of reference types cannot be directly embedded into composite objects and are always dynamically allocated...

s and unboxed value type

Value type

In computer science, the term value type is commonly used to refer to one of two kinds of data types: Types of values or Types of objects with deep copy semantics.- Types of Values :...

s, the garbage collector needs to somehow be able to distinguish which variables on the stack or fields in an object are regular values and which are references: in memory, an integer and a reference might look alike. The garbage collector then needs to know whether to treat the element as a reference and follow it, or whether it is a primitive value. One common solution is the use of tagged pointer

Tagged pointer

In computer science, a tagged pointer is a common example of a tagged union, where the primary type of data to be stored in the union is a pointer...

Strong and Weak references

The garbage collector can reclaim only objects that have no references. However, there can exist additional references that, in a sense, do not matter, which are called weak references

Weak reference

In computer programming, a weak reference is a reference that does not protect the referenced object from collection by a garbage collector . An object referenced only by weak references is considered unreachable and so may be collected at any time...

. In discussions about weak references, ordinary references are sometimes called strong references. An object is eligible for garbage collection if there are no strong (i.e. ordinary) references to it, even though there still might be some weak references to it.

A weak reference is not merely just any pointer to the object that a garbage collector does not care about. The term is usually reserved for a properly managed category of special reference objects which are safe to use even when the object disappears because they lapse to a safe value. An unsafe reference that is not known to the garbage collector will simply remain dangling by continuing to refer to the address where the object previously resided. This is not a weak reference.

In some implementations, notably in Microsoft .NET, the weak references are divided into two further subcategories: long weak references (tracks resurrection) and short weak references.

Weak Collections

Objects which maintain collections of other objects can also be devised which have weak tracking features. For instance, weak hash tables are useful. Like a regular hash table, a weak hash table maintains an association between pairs of objects, where each pair is understood to be a key and value. However, the hash table does not actually maintain a strong reference on these objects. A special behavior takes place when either the key or value or both become garbage: the hash table entry is spontaneously deleted. There exist further refinements such as hash tables which have only weak keys (value references are ordinary, strong references) or only weak values (key references are weak).

Weak hash tables are important for maintaining associations between objects, such that the objects engaged in the association can still become garbage if nothing in the program refers to them any longer (other than the associating hash table).

The use of a regular hash table for such a purpose could lead to a "logical memory leak": the accumulation of reachable data which the program does not need and will not use.

Basic algorithm

Tracing collectors are so called because they trace through the working set of memory. These garbage collectors perform collection in cycles. A cycle is started when the collector decides (or is notified) that it needs to reclaim memory, which happens most often when the system is low on memory. The original method involves a naïve mark-and-sweep in which the entire memory set is touched several times.

Naïve mark-and-sweep

In the naïve mark-and-sweep method, each object in memory has a flag (typically a single bit) reserved for garbage collection use only. This flag is always cleared, except during the collection cycle. The first stage of collection does a tree traversal of the entire 'root set', marking each object that is pointed to as being 'in-use'. All objects that those objects point to, and so on, are marked as well, so that every object that is ultimately pointed to from the root set is marked. Finally, all memory is scanned from start to finish, examining all free or used blocks; those with the in-use flag still cleared are not reachable by any program or data, and their memory is freed. (For objects which are marked in-use, the in-use flag is cleared again, preparing for the next cycle.)

This method has several disadvantages, the most notable being that the entire system must be suspended during collection; no mutation of the working set can be allowed. This will cause programs to 'freeze' periodically (and generally unpredictably), making real-time and time-critical applications impossible. In addition, the entire working memory must be examined, much of it twice, potentially causing problems in paged memory systems.

Tri-color marking

Because of these pitfalls, most modern tracing garbage collectors implement some variant of the tri-colour marking abstraction

Abstraction (computer science)

In computer science, abstraction is the process by which data and programs are defined with a representation similar to its pictorial meaning as rooted in the more complex realm of human life and language with their higher need of summarization and categorization , while hiding away the...

, but simple collectors (such as the mark-and-sweep collector) often do not make this abstraction explicit.
Tri-colour marking works as follows:

Create initial white, grey, and black sets; these sets will be used to maintain progress during the cycle.
- Initially the white set or condemned set is the set of objects that are candidates for having their memory recycled.
- The black set is the set of objects that cheaply can be proven to have no references to objects in the white set; in many implementations the black set starts off empty.
- The grey set is all the objects that are reachable from root references but the objects referenced by grey objects haven't been scanned yet. Grey objects are known to be reachable from the root, so cannot be garbage collected: grey objects will eventually end up in the black set. The grey state means we still need to check any objects that the object references.
- The grey set is initialised to objects which are referenced directly at root level; typically all other objects are initially placed in the white set.
- Objects can move from white to grey to black, never in the other direction.
Pick an object from the grey set. Blacken this object (move it to the black set), by greying all the white objects it references directly. This confirms that this object cannot be garbage collected, and also that any objects it references cannot be garbage collected.
Repeat the previous step until the grey set is empty.
When there are no more objects in the grey set, then all the objects remaining in the white set have been demonstrated not to be reachable, and the storage occupied by them can be reclaimed.

The 3 sets partition

Partition of a set

In mathematics, a partition of a set X is a division of X into non-overlapping and non-empty "parts" or "blocks" or "cells" that cover all of X...

memory; every object in the system, including the root set, is in precisely one set.

The tri-colour marking algorithm preserves an important invariant:

No black object points directly to a white object.

This ensures that the white objects can be safely destroyed once the grey set is empty. (Some variations on the algorithm do not preserve the tricolour invariant but they use a modified form for which all the important properties hold.)

The tri-colour method has an important advantage: it can be performed 'on-the-fly', without halting the system for significant time periods. This is accomplished by marking objects as they are allocated and during mutation, maintaining the various sets. By monitoring the size of the sets, the system can perform garbage collection periodically, rather than as-needed. Also, the need to touch the entire working set each cycle is avoided.

Implementation strategies

In order to implement the basic tri-colour algorithm, several important design decisions must be made, which can significantly affect the performance characteristics of the garbage collector.

Moving vs. non-moving

Once the unreachable set has been determined, the garbage collector may simply release the unreachable objects and leave everything else as it is, or it may copy some or all of the reachable objects into a new area of memory, updating all references to those objects as needed. These are called "non-moving" and "moving" (or, alternatively, "non-compacting" and "compacting") garbage collectors, respectively.

At first, a moving GC strategy may seem inefficient and costly compared to the non-moving approach, since much more work would appear to be required on each cycle. In fact, however, the moving GC strategy leads to several performance advantages, both during the garbage collection cycle itself and during actual program execution:

No additional work is required to reclaim the space freed by dead objects; the entire region of memory from which reachable objects were moved can be considered free space. In contrast, a non-moving GC must visit each unreachable object and somehow record that the memory it alone occupied is available.
Similarly, new objects can be allocated very quickly. Since large contiguous regions of memory are usually made available by the moving GC strategy, new objects can be allocated by simply incrementing a 'free memory' pointer. A non-moving strategy may, after some time, lead to a heavily fragmented
Fragmentation (computer)
In computer storage, fragmentation is a phenomenon in which storage space is used inefficiently, reducing storage capacity and in most cases reducing the performance. The term is also used to denote the wasted space itself....

heap, requiring expensive consultation of "free lists" of small available blocks of memory in order to allocate new objects.
If an appropriate traversal order is used (such as cdr-first for list cons
Cons
In computer programming, cons is a fundamental function in most dialects of the Lisp programming language. cons constructs memory objects which hold two values or pointers to values. These objects are referred to as cells, conses, non-atomic s-expressions , or pairs...

es), objects that refer to each other frequently can be moved very close to each other in memory, increasing the likelihood that they will be located in the same cache line or virtual memory
Virtual memory
In computing, virtual memory is a memory management technique developed for multitasking kernels. This technique virtualizes a computer architecture's various forms of computer data storage , allowing a program to be designed as though there is only one kind of memory, "virtual" memory, which...

page. This can significantly speed up access to these objects through these references.

One disadvantage of a moving garbage collector is that it only allows access through references that are managed by the garbage collected environment, and does not allow pointer arithmetic. This is because any native pointers to objects will be invalidated when the garbage collector moves the object (they become dangling pointer

Dangling pointer

Dangling pointers and wild pointers in computer programming are pointers that do not point to a valid object of the appropriate type. These are special cases of memory safety violations....

s). For interoperability

Interoperability

Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to...

with native code, the garbage collector must copy the object contents to a location outside of the garbage collected region of memory. An alternative approach is to pin the object in memory, preventing the garbage collector from moving it and allowing the memory to be directly shared with native pointers (and possibly allowing pointer arithmetic).

Copying vs. mark-and-sweep vs. mark-and-don't-sweep

To further refine the distinction, tracing collectors can also be divided by considering how the three sets of objects (white, grey, and black) are maintained during a collection cycle.

The most straightforward approach is the semi-space collector, which dates to 1969. In this moving GC scheme, memory is partitioned into a "from space" and "to space". Initially, objects are allocated into "to space" until they become full and a collection is triggered. At the start of a collection, the "to space" becomes the "from space", and vice versa. The objects reachable from the root set are copied from the "from space" to the "to space". These objects are scanned in turn, and all objects that they point to are copied into "to space", until all reachable objects have been copied into "to space". Once the program continues execution, new objects are once again allocated in the "to space" until it is once again full and the process is repeated. This approach has the advantage of conceptual simplicity (the three object color sets are implicitly constructed during the copying process), but the disadvantage that a (possibly) very large contiguous region of free memory is necessarily required on every collection cycle. This technique is also known as stop-and-copy. Cheney's algorithm

Cheney's algorithm

Cheney's algorithm, first described in a 1970 ACM paper by C.J. Cheney, is a method of garbage collection in computer software systems. In this scheme, the heap is divided into two equal halves, only one of which is in use at any one time...

is an improvement on the semi-space collector.

A mark and sweep garbage collector maintains a bit (or two) with each object to record whether it is white or black; the grey set is either maintained as a separate list (such as the process stack) or using another bit. As the reference tree is traversed during a collection cycle (the "mark" phase), these bits are manipulated by the collector to reflect the current state. A final "sweep" of the memory areas then frees white objects. The mark and sweep strategy has the advantage that, once the unreachable set is determined, either a moving or non-moving collection strategy can be pursued; this choice of strategy can even be made at runtime, as available memory permits. It has the disadvantage of "bloating" objects by a small amount.

A mark and don't sweep garbage collector, like the mark-and-sweep, maintains a bit with each object to record whether it is white or black; the gray set is either maintained as a separate list (such as the process stack) or using another bit. There are two key differences here. First, black and white mean different things than they do in the mark and sweep collector. In a "mark and don't sweep" system, all reachable objects are always black. An object is marked black at the time it is allocated, and it will stay black even if it becomes unreachable. A white object is unused memory and may be allocated. Second, the interpretation of the black/white bit can change. Initially, the black/white bit may have the sense of (0=white, 1=black). If an allocation operation ever fails to find any available (white) memory, that means all objects are marked used (black). The sense of the black/white bit is then inverted (for example, 0=black, 1=white). Everything becomes white. This momentarily breaks the invariant that reachable objects are black, but a full marking phase follows immediately, to mark them black again. Once this is done, all unreachable memory is white. No "sweep" phase is necessary.

Generational GC (ephemeral GC)

It has been empirically observed that in many programs, the most recently created objects are also those most likely to become unreachable quickly (known as infant mortality or the generational hypothesis). A generational GC (also known as ephemeral GC) divides objects into generations and, on most cycles, will place only the objects of a subset of generations into the initial white (condemned) set. Furthermore, the runtime system maintains knowledge of when references cross generations by observing the creation and overwriting of references. When the garbage collector runs, it may be able to use this knowledge to prove that some objects in the initial white set are unreachable without having to traverse the entire reference tree. If the generational hypothesis holds, this results in much faster collection cycles while still reclaiming most unreachable objects.

In order to implement this concept, many generational garbage collectors use separate memory regions for different ages of objects. When a region becomes full, those few objects that are referenced from older memory regions are promoted (copied) up to the next highest region, and the entire region can then be overwritten with fresh objects. This technique permits very fast incremental garbage collection, since the garbage collection of only one region at a time is all that is typically required.

Generational garbage collection is a heuristic approach, and some unreachable objects may not be reclaimed on each cycle. It may therefore occasionally be necessary to perform a full mark and sweep or copying garbage collection to reclaim all available space. In fact, runtime systems for modern programming languages (such as Java

Java (programming language)

and the .NET Framework

.NET Framework

The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

) usually use some hybrid of the various strategies that have been described thus far; for example, most collection cycles might look only at a few generations, while occasionally a mark-and-sweep is performed, and even more rarely a full copying is performed to combat fragmentation. The terms "minor cycle" and "major cycle" are sometimes used to describe these different levels of collector aggression.

Stop-the-world vs. incremental vs. concurrent

Simple stop-the-world garbage collectors completely halt execution of the program to run a collection cycle, thus guaranteeing that new objects are not allocated and objects do not suddenly become unreachable while the collector is running.

This has the obvious disadvantage that the program can perform no useful work while a collection cycle is running (sometimes called the "embarrassing pause"). Stop-the-world garbage collection is therefore mainly suitable for non-interactive programs. Its advantage is that it is both simpler to implement and faster than incremental garbage collection.

Incremental and concurrent garbage collectors are designed to reduce this disruption by interleaving their work with activity from the main program. Incremental garbage collectors perform the garbage collection cycle in discrete phases, with program execution permitted between each phase (and sometimes during some phases). Concurrent garbage collectors do not stop program execution at all, except perhaps briefly when the program's execution stack is scanned. However, the sum of the incremental phases takes longer to complete than one batch garbage collection pass, so these garbage collectors may yield lower total throughput.

Careful design is necessary with these techniques to ensure that the main program does not interfere with the garbage collector and vice versa; for example, when the program needs to allocate a new object, the runtime system may either need to suspend it until the collection cycle is complete, or somehow notify the garbage collector that there exists a new, reachable object.

Precise vs. conservative and internal pointers

Some collectors can correctly identify all pointers (references) in an object; these are called precise (also exact or accurate) collectors, the opposite being a conservative or partly conservative collector. Conservative collectors assume that any bit pattern in memory could be a pointer if, interpreted as a pointer, it would point into an allocated object. Conservative collectors may produce false positives, where unused memory is not released because of improper pointer identification. This is not always a problem in practice unless the program handles a lot of data that could easily be misidentified as a pointer. False positives are generally less problematic on 64-bit systems than on 32-bit systems because the range of valid memory addresses tends to be a tiny fraction of the range of 64-bit values. Thus, an arbitrary 64-bit pattern is unlikely to mimic a valid pointer. Whether a precise collector is practical usually depends on the type safety properties of the programming language in question. An example for which a conservative garbage collector would be needed is the C language

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

, which allows typed (non-void) pointers to be type cast into untyped (void) pointers, and vice versa.

A related issue concerns internal pointers, or pointers to fields within an object. If the semantics of a language allow internal pointers, then there may be many different addresses that can refer to parts of the same object, which complicates determining whether an object is garbage or not. An example for this is the C++

C++

language, in which multiple inheritance can cause pointers to base objects to have different addresses. Even in languages like Java

Java (programming language)

, internal pointers can exist during the computation of, say, an array element address. In a tightly-optimized program, the corresponding pointer to the object itself may have been overwritten in its register, so such internal pointers need to be scanned.

Performance implications

Tracing garbage collectors require some implicit runtime overhead

Computational overhead

In computer science, overhead is generally considered any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal...

that may be beyond the control of the programmer, and can sometimes lead to performance problems. For example, commonly used stop-the-world garbage collectors, which pause program execution at arbitrary times, may make garbage collection inappropriate for some embedded system

Embedded system

An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...

s, high-performance server

Server (computing)

In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...

software, and applications with real-time

Real-time computing

In computer science, real-time computing , or reactive computing, is the study of hardware and software systems that are subject to a "real-time constraint"— e.g. operational deadlines from event to system response. Real-time programs must guarantee response within strict time constraints...

needs.

Manual heap allocation:

search for best/first-fit block of sufficient size
free list maintenance

Garbage collection:

locate reachable objects
copy reachable objects for moving collectors
read/write barriers for incremental collectors
search for best/first-fit block and free list maintenance for non-moving collectors

It is difficult to compare the two cases directly, as their behavior depends on the situation. For example, in the best case for a garbage collecting system, allocation just increments a pointer, but in the best case for manual heap allocation, the allocator maintains freelists of specific sizes and allocation only requires following a pointer. However, this size segregation usually cause a large degree of external fragmentation, which can have an adverse impact on cache behaviour. Memory allocation in a garbage collected language may be implemented using heap allocation behind the scenes (rather than simply incrementing a pointer), so the performance advantages listed above don't necessarily apply in this case. In some situations, most notably embedded system

Embedded system

s, it is possible to avoid both garbage collection and heap management overhead by preallocating pools of memory and using a custom, lightweight scheme for allocation/deallocation.

The overhead of write barriers is more likely to be noticeable in an imperative

Imperative programming

In computer science, imperative programming is a programming paradigm that describes computation in terms of statements that change a program state...

-style program which frequently writes pointers into existing data structures than in a functional

Functional programming

In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state...

-style program which constructs data only once and never changes them.

Some advances in garbage collection can be understood as reactions to performance issues. Early collectors were stop-the-world collectors, but the performance of this approach was distracting in interactive applications. Incremental collection avoided this disruption, but at the cost of decreased efficiency due to the need for barriers. Generational collection techniques are used with both stop-the-world and incremental collectors to increase performance; the trade-off is that some garbage is not detected as such for longer than normal.

Determinism

Tracing garbage collection is not deterministic
Deterministic algorithm
In computer science, a deterministic algorithm is an algorithm which, in informal terms, behaves predictably. Given a particular input, it will always produce the same output, and the underlying machine will always pass through the same sequence of states...

in the timing of object finalization. An object which becomes eligible for garbage collection will usually be cleaned up eventually, but there is no guarantee when (or even if) that will happen. This is an issue for program correctness when objects are tied to non-memory resources, whose release is an externally-visible program behavior, such as closing a network connection, releasing a device or closing a file. One garbage collection technique which provides determinism in this regard is reference counting.

Garbage collection can have a nondeterministic impact on execution time, by potentialy introducing pauses into the execution of a program which are not correlated with the algorithm being processed. Under tracing garbage collection, the request to allocate a new object can sometimes return quickly and at other times trigger a lengthy garbage collection cycle. Under reference counting, whereas allocation of objects is usually fast, decrementing a reference is nondeterministic, since a reference may reach zero, triggering recursion to decrement the reference counts of other objects which that object holds.

Real-time garbage collection

While garbage collection is generally nondeterministic, it is possible to use it in hard real-time

Real-time computing

systems. Real-time garbage collector should guaranty that even in worst case it will dedicate certain number of computational resources to mutator threads. Constraints imposed on real-time garbage collector are usually either work based or time based. Time based constraint would look like: within each time window of duration T, mutator threads should be allowed to run at least for Tm time. For work based analysis, MMU (minimal mutator utilization) is usually used as real time constraint for garbage collection algorithm.

One of first implementation of real-time

Real-time computing

garbage collection for JVM was work on Metronome algorithm. There are other commercial implementations.

Reference counting

Reference counting is a form of garbage collection whereby where each object has a count of the number of references to it. Garbage is identified by having a reference count of zero. An object's reference count is incremented when a reference to it is created, and decremented when a reference is destroyed. The object's memory is reclaimed when the count reaches zero.

There are two major disadvantages to reference counting:

If two or more objects refer to each other, they can create a cycle whereby neither will be collected as their mutual references never let their reference counts become zero. Some garbage collection systems using reference counting (like the one in CPython
CPython
CPython is the default, most-widely used implementation of the Python programming language. It is written in C. In addition to CPython, there are two other production-quality Python implementations: Jython, written in Java, and IronPython, which is written for the Common Language Runtime. There...

) use specific cycle-detecting algorithms to deal with this issue. Another strategy is to use weak references
Weak reference
In computer programming, a weak reference is a reference that does not protect the referenced object from collection by a garbage collector . An object referenced only by weak references is considered unreachable and so may be collected at any time...

for the "backpointers" which create cycles. Under reference counting, a weak reference is similar to a weak reference under a tracing garbage collector. It is a special reference object whose existence does not increment the reference count of the referent object. Furthermore, a weak reference is safe in that when the referent object becomes garbage, any weak reference to it lapses, rather than being permitted to remain dangling, meaning that it turns into a predictable value, such as a null reference.

In naive implementations, each assignment of a reference and each reference falling out of scope often require modifications of one or more reference counters. However, in the common case, when a reference is copied from an outer scope variable into an inner scope variable, such that the lifetime of the inner variable is bounded by the lifetime of the outer one, the reference incrementing can be eliminated. The outer variable "owns" the reference. In the programming language C++, this technique is readily implemented and demonstrated with the use of const references. Reference counting in C++ is usually implemented using "smart pointers" whose constructors, destructors and assignment operators manage the references. A smart pointer can be passed by reference to a function, which avoids the need to copy-construct a new reference (which would increase the reference count on entry into the function and decrease it one exit). Instead the function receives a reference to the smart pointer which is produced inexpensively.

When used in a multithreaded environment, these modifications (increment and decrement) may need to be atomic operations such as compare-and-swap

Compare-and-swap

In computer science, the compare-and-swap CPU instruction is a special instruction that atomically compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value...

, at least for any objects which are shared, or potentially shard among multiple threads. Atomic operations are expensive on a multiprocessor, and even more expensive if they have to be emulated with software algorithms. There are ways to optimize this situation, such as having two-level references with multiple counts. A reference to R to an object O can actually be instead represented as a reference R to a reference-counted object RO which itself is a reference to O. R and RO are not shared among threads. Different threads have their own nonshared R and RO pointing to the shared O. Each RO owns a single reference count within O. The reference R can be passed around within a thread, such that only the count within RO increments and decrements, without the use of atomic operations. Only when the reference count of RO reaches zero (the thread has lots its last private reference to the object), then the reference count of O is decremented using an atomic operation. A special operation for inter-thread-object passing can instantiate a new R pointing to a new RO. All of this is encapsulated so that it is transparent: R just looks like a reference to O in most situations in the program.

It is a common misconception that reference counting has deterministic performance compared to tracing garbage collection. Reference counting provides quasi-deterministing reclamation performance in situations where objects are obviously short-lived (the same situation in which generational GC techniques also provide quasi-deterministic behavior). In the general case, whenever an object's reference is decremented, there is no telling how much CPU time will be consumed. Each object whose reference count reaches zero (i.e. each object which is garbage) has to recursively trigger a reference count decrement operation on all other objects which it holds, some of which may also reach zero reference counts. Reference counting in the general case will also introduce nondeterministic pauses.

Escape analysis

In programming language compiler optimization theory, escape analysis is a method for determining the dynamic scope of pointers. It is related to pointer analysis and shape analysis....

can be used to convert heap allocations to stack allocations, thus reducing the amount of work needed to be done by the garbage collector.

Availability

Generally speaking, higher-level programming languages

High-level programming language

A high-level programming language is a programming language with strong abstraction from the details of the computer. In comparison to low-level programming languages, it may use natural language elements, be easier to use, or be from the specification of the program, making the process of...

are more likely to have garbage collection as a standard feature. In languages that do not have built in garbage collection, it can often be added through a library, as with the Boehm garbage collector

Boehm garbage collector

In computer science, the Boehm-Demers-Weiser garbage collector, often simply known as Boehm GC, is a conservative garbage collector for C and C++.Boehm GC is free software distributed under a permissive free software licence similar to the X11 license....

for C and C++. This approach is not without drawbacks, such as changing object creation and destruction mechanisms.

Most functional programming languages, such as ML, Haskell

Haskell (programming language)

Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...

, and APL, have garbage collection built in. Lisp, which introduced functional programming

Functional programming

, is especially notable for introducing this mechanism.

Other dynamic languages, such as Ruby

Ruby (programming language)

Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

(but not Perl

Perl

Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

5, or PHP

PHP

PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

, which use reference counting), also tend to use GC. Object-oriented programming

Object-oriented programming

Object-oriented programming is a programming paradigm using "objects" – data structures consisting of data fields and methods together with their interactions – to design applications and computer programs. Programming techniques may include features such as data abstraction,...

languages such as Smalltalk

Smalltalk

Smalltalk is an object-oriented, dynamically typed, reflective programming language. Smalltalk was created as the language to underpin the "new world" of computing exemplified by "human–computer symbiosis." It was designed and created in part for educational use, more so for constructionist...

, Java

Java (programming language)

and ECMAScript

ECMAScript

ECMAScript is the scripting language standardized by Ecma International in the ECMA-262 specification and ISO/IEC 16262. The language is widely used for client-side scripting on the web, in the form of several well-known dialects such as JavaScript, JScript, and ActionScript.- History :JavaScript...

usually provide integrated garbage collection. Notable exceptions are C++

C++

and Delphi which have destructors

Destructor (computer science)

In object-oriented programming, a destructor is a method which is automatically invoked when the object is destroyed...

. Objective-C

Objective-C

Objective-C is a reflective, object-oriented programming language that adds Smalltalk-style messaging to the C programming language.Today, it is used primarily on Apple's Mac OS X and iOS: two environments derived from the OpenStep standard, though not compliant with it...

has not traditionally had it, but ObjC 2.0 as implemented by Apple for Mac OS X

Mac OS X

Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

uses a runtime collector developed in-house, while the GNUstep

GNUstep

GNUstep is a free software implementation of Cocoa Objective-C libraries , widget toolkit, and application development tools not only for Unix-like operating systems, but also for Microsoft Windows. It is part of the GNU Project.GNUstep features a cross-platform, object-oriented development...

project uses a Boehm collector.

Historically, languages intended for beginners, such as BASIC

BASIC

BASIC is a family of general-purpose, high-level programming languages whose design philosophy emphasizes ease of use - the name is an acronym from Beginner's All-purpose Symbolic Instruction Code....

and Logo

Logo (programming language)

Logo is a multi-paradigm computer programming language used in education. It is an adaptation and dialect of the Lisp language; some have called it Lisp without the parentheses. It was originally conceived and written as functional programming language, and drove a mechanical turtle as an output...

, have often used garbage collection for heap-allocated variable-length data types, such as strings and lists, so as not to burden programmers with manual memory management. On early microcomputers, with their limited memory and slow processors, BASIC garbage collection could often cause apparently random, inexplicable pauses in the midst of program operation. Some BASIC interpreters such as Applesoft BASIC on the Apple II family, had terribly inefficient garbage collectors for strings which repeatedly scanned the string descriptors for the string having the highest address in order to compact it toward high memory. This one-string-at-a-time processing loop resulted in O(N*N) time performance in the number of strings, which would introduce a pause more than one minute long into the execution of string-intensive programs. A replacement garbage collector for Applesoft BASIC published in Call-A.P.P.L.E.

Call-A.P.P.L.E.

Call-A.P.P.L.E. Magazine is the monthly journal publication of the Apple Pugetsound Program Library Exchange . The magazine was published from 1978 until 1990 when it was discontinued; after a 12 year lapse publication was restarted in 2002...

(January 1981, pages 40-45, Randy Wiggington) identified a group of strings in every pass over the heap, reducing a pause of two minutes into less than a second depending on the size of the group. Other approaches were published, but none ever made it into a new revision of the BASIC interpreter.

Limited environments

Garbage collection is rarely used on embedded or real-time systems because of the perceived need for very tight control over the use of limited resources. However, garbage collectors compatible with such limited environments have been developed. The Microsoft .NET Micro Framework

.NET Micro Framework

The .NET Micro Framework is an Open Source .NET platform for resource-constrained devices with at least 256 KBytes of flash and 64 KBytes of RAM. It includes a small version of the .NET CLR and supports development in C#, Visual Basic .NET, and debugging using Microsoft Visual Studio...

and Java Platform, Micro Edition

Java Platform, Micro Edition

Java Platform, Micro Edition, or Java ME, is a Java platform designed for embedded systems . Target devices range from industrial controls to mobile phones and set-top boxes...

are embedded software platforms that, like their larger cousins, include garbage collection.

External links

The Garbage Collection Handbook: The Art of Automatic Memory Management
Richard Jones' Garbage Collection Page at the University of Kent
University of Kent
The University of Kent, previously the University of Kent at Canterbury, is a public research university based in Kent, United Kingdom...
The Memory Management Reference
The Very Basics of Garbage Collection, by Thomas Padron-McCarthy
Publications by the OOPS group at the University of Texas at Austin
University of Texas at Austin
The University of Texas at Austin is a state research university located in Austin, Texas, USA, and is the flagship institution of the The University of Texas System. Founded in 1883, its campus is located approximately from the Texas State Capitol in Austin...
A garbage collector for C and C++ by Hans Boehm
Expensive Explicit Deallocation: An Example by Hans Boehm
On-the-fly garbage collection: an exercise in cooperation by Edsger W. Dijkstra and Leslie Lamport
Leslie Lamport
Leslie Lamport is an American computer scientist. A graduate of the Bronx High School of Science, he received a B.S. in mathematics from the Massachusetts Institute of Technology in 1960, and M.A. and Ph.D. degrees in mathematics from Brandeis University, respectively in 1963 and 1972...

and A.J.Martin and C.S.Scholten and E.F.M.Steffens
A Garbage Collection Course Curriculum at MSDN Academic Alliance
A Glance at Garbage Collection in Object-Oriented Languages
Notes on the CLR Garbage Collector
Understanding GC pauses in HotSpot JVM

Implementations

A Real-Time Garbage Collector Based on the Lifetimes of Objects by H. Lieberman and C. Hewitt, MIT Artificial Intelligence Laboratory
TinyGC - an independent implementation of the BoehmGC API
Conservative Garbage Collection Implementation for C Language by Yasin Hınıslıoğlu
MeixnerGC - an incremental mark and sweep garbage collector for C++ using smart pointers

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.