Inline caching

Inline caching is an optimization technique

Compiler optimization

Compiler optimization is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the amount of memory occupied...

employed by some language runtimes

Run-time system

A run-time system is a software component designed to support the execution of computer programs written in some computer language...

, and first developed for Smalltalk

Smalltalk

Smalltalk is an object-oriented, dynamically typed, reflective programming language. Smalltalk was created as the language to underpin the "new world" of computing exemplified by "human–computer symbiosis." It was designed and created in part for educational use, more so for constructionist...

.
The goal of inline caching is to speed up runtime method binding

Dynamic dispatch

In computer science, dynamic dispatch is the process of mapping a message to a specific sequence of code at runtime. This is done to support the cases where the appropriate method can't be determined at compile-time...

by remembering the results of a previous method lookup directly at the call site

Call site

In programming, a call site of a function is a line in the code which calls a function. A call site passes zero or more arguments to the function, and receives zero or more return values.-Example:...

. Inline caching is especially useful for dynamically typed languages where most if not all method binding happens at runtime and where virtual method table

Virtual method table

A virtual method table, virtual function table, dispatch table, or vtable, is a mechanism used in a programming language to support dynamic dispatch ....

s often cannot be used.

Runtime method binding

The following ECMAScript

ECMAScript

ECMAScript is the scripting language standardized by Ecma International in the ECMA-262 specification and ISO/IEC 16262. The language is widely used for client-side scripting on the web, in the form of several well-known dialects such as JavaScript, JScript, and ActionScript.- History :JavaScript...

function receives an object, invokes its toString-method and displays the results on the page the script is embedded in.

function dump(obj) {
document.write(obj.toString);
}

Since the type of the object is not specified and because of potential method overloading

Method overloading

Function overloading or method overloading is a feature found in various programming languages such as Ada, C#, VB.NET, C++, D and Java that allows the creation of several methods with the same name which differ from each other in terms of the type of the input and the type of the output of the...

, it is impossible to decide ahead of time, which concrete implementation of the toString-method is going to be invoked. Instead, a dynamic lookup has to be performed at runtime. In language runtimes that do not employ some form of caching, this lookup is performed every time a method is invoked. Because methods may be defined several steps down the inheritance chain

Inheritance (computer science)

In object-oriented programming , inheritance is a way to reuse code of existing objects, establish a subtype from an existing object, or both, depending upon programming language support...

, a dynamic lookup can be an expensive operation.

To achieve better performance, many language runtimes employ some form of non-inline caching where the results of a limited number of method lookups are stored in an associative data structure. This can greatly increase performance, provided that the programs executed are "cache friendly" (i.e. there is a limited set of methods that is invoked frequently). This data structure is typically called the first-level method lookup cache .

The concept of inline caching is based on the empirical observation that the objects that occur at a particular call site are often of the same type. In those cases, performance can be increased greatly by storing the result of a method lookup "inline", i.e. directly at the call site. To facilitate this process, call sites are assigned different states. Initially, a call site is considered to be "uninitialized". Once the language runtime reaches a particular uninitialized call site, it performs the dynamic lookup, stores the result at the call site and changes its state to "monomorphic". If the language runtime reaches the same call site again, it retrieves the callee

Called party

The called party is a person who answers a telephone call. The person who initiates a telephone call is the calling party....

from it and invokes it directly without performing any more lookups. To account for the possibility that objects of different types may occur at the same call site, the language runtime also has to insert guard conditions

Guard (computing)

In computer programming, a guard is a boolean expression that must evaluate to true if the program execution is to continue in the branch in question. The term is used at least in Haskell, Clean, Erlang, occam, Promela, OCaml and Scala programming languages. In Mathematica, guards are called...

into the code. Most commonly, these are inserted into the preamble of the callee rather than at the call site to better exploit branch prediction

Branch predictor

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline...

and to save space due to one copy in the preamble versus multiple copies at each call site. If a call site that is in the "monomorphic" state encounters a type other than the one it expects, it has to change back to the "uninitialized" state and perform a full dynamic lookup again.

The canonical implementation is a register load of a constant followed by a call instruction. The "uninitialized" state is better called "unlinked". The register is loaded with the message selector (typically the address of some object) and the call is to the run-time routine that will look-up the message in the class of the current receiver, using the first-level method lookup cache above. The run-time routine then rewrites the instructions, changing the load instruction to load the register with the type of the current receiver, and the call instruction to call the preamble of the target method, now "linking" the call site to the target method. Execution then continues immediately following the preamble. A subsequent execution will call the preamble directly. The preamble then derives the type of the current receiver and compares it with that in the register; if they agree the receiver is of the same type and the method continues to execute. If not, the preamble again calls the run-time and various strategies are possible, one being to relink the call-site for the new receiver type.

The performance gains come from having to do one type comparison, instead of at least a type comparison and a selector comparison for the first-level method lookup cache, and from using a direct call (which will benefit from instruction prefetch and pipe-lining) as opposed to the indirect call in a method-lookup or a vtable dispatch.

Monomorphic inline caching

If a particular call site frequently sees different types of objects, the performance benefits of inline caching can easily be nullified by the overhead induced by the frequent changes in state of the call site. The following example constitutes a worst case scenario for monomorphic inline caching:

var values = [1, "a", 2, "b", 3, "c", 4, "d"];
for (var i in values) {
document.write(values[i].toString);
}

Again, the method toString is invoked on an object whose type is not known in advance. More importantly though, the type of the object changes with every iteration of the surrounding loop. A naive implementation of monomorphic inline caching would therefore constantly cycle through the "uninitialized" and "monomorphic" states. In order to prevent this from happening, most implementations of monomorphic inline caching support a third state often referred to as the "megamorphic" state. This state is entered when a particular call site has seen a predetermined number of different types. Once a call site has entered the "megamorphic" state, it will behave just as it did in the "uninitialized" state with the exception that it will not enter the "monomorphic" state ever again (some implementations of monomorphic inline caching will change "megamorphic" call sites back to being "uninitialized" after a certain amount of time has passed or once a full garbage collection

Garbage collection (computer science)

In computer science, garbage collection is a form of automatic memory management. The garbage collector, or just collector, attempts to reclaim garbage, or memory occupied by objects that are no longer in use by the program...

cycle is performed).

Polymorphic inline caching

To better deal with call sites that frequently see a limited number of different types, some language runtimes employ a technique called polymorphic inline caching . With polymorphic inline caching, once a call site that is in its "monomorphic" state sees its second type, rather than reverting to the "uninitialized" state it switches to a new state called "polymorphic". A "polymorphic" call site decides which of a limited set of known methods to invoke based on the type that it is currently presented with. In other words, with polymorphic inline caching, multiple method lookup results can be recorded at the same call site. Because every call site in a program can potentially see every type in the system, there usually is an upper bound to how many lookup results are recorded at each call site. Once that upper bound is reached, call sites become "megamorphic" and no more inline caching is performed.

The canonical implementation is a jump table which consists of a preamble that derives the type of the receiver and a series of constant compares and conditional jumps that jump to the code following the preamble in the relevant method for each receiver type. The jump table is typically allocated for a particular call-site when a monomorphic call-site encounters a different type. The jump-table will have a fixed size and be able to grow, adding cases as new types are encountered up to some small maximum number of cases such as 4, 6 or 8. Once it reaches its maximum size execution for a new receiver type will "fall-off" the end and enter the run-time, typically to perform a method lookup starting with the first-level method cache.

The observation that together, monomorphic and polymorphic inline caches collect per-call-site receiver type information as a side-effect of optimizing program execution led to the development of adaptive optimization

Adaptive optimization

Adaptive optimization is a technique in computer science that performs dynamic recompilation of portions of a program based on the current execution profile. With a simple implementation, an adaptive optimizer may simply make a trade-off between Just-in-time compilation and interpreting instructions...

in Self, where the run-time optimizes "hot spots" in the program using the type information in inline caches to guide speculative inlining decisions.

Megamorphic inline caching

If a run-time uses both monomorphic and polymorphic inline caching then in the steady state the only unlinked sends occurring will be those from sends falling-off the ends of polymorphic inline caches. Since such sends are slow it can now be profitable to optimize these sites. A megamorphic inline cache can be implemented by creating code to perform a first-level method lookup for a particular call-site. In this scheme once a send falls-off the end of a polymorphic inline cache a megamorphic cache specific to the call site's selector is created (or shared if one already exists), and the send site is relinked to call it. The code can be significantly more efficient than a normal first-level method lookup probe since the selector is now a constant, which decreases register pressure, the code for the lookup and dispatch is executed without calling into the run-time, and the dispatch can benefit from branch prediction

Branch predictor

.

Empirical measurements show that in large Smalltalk programs about 1/3 of all send sites in active methods remain unlinked, and of the remaining 2/3, 90% are monomorphic, 9% polymorphic and 1% (0.9%) are megamorphic.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.