Loop unwinding
Encyclopedia
Loop unwinding, also known as loop unrolling, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary
Binary file
A binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...

 size (space-time tradeoff
Space-time tradeoff
In computer science, a space–time or time–memory tradeoff is a situation where the memory use can be reduced at the cost of slower program execution...

). The transformation can be undertaken manually by the programmer or by an optimizing compiler.

The goal of loop unwinding is to increase a program's speed by reducing (or eliminating) instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration; reducing branch penalties; as well as "hiding latencies, in particular, the delay in reading data from memory". Loops can be re-written instead as a repeated sequence of similar independent statements eliminating this overhead
Computational overhead
In computer science, overhead is generally considered any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal...

.

Advantages

The overhead in "tight" loops often consists of instructions to increment a pointer or index to the next element in an array (pointer arithmetic), as well as "end of loop" tests. If an optimizing compiler or assembler is able to pre-calculate offsets to each individually referenced array variable, these can be built into the machine code
Machine code
Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...

 instructions directly, therefore requiring no additional arithmetic operations at run time (note that in the example given below this is not the case).
  • Significant gains can be realized if the reduction in executed instructions more than compensates for any performance reduction caused by any increase in the size of the program.
  • branch penalty
    Branch predictor
    In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline...

     is minimised.
  • If the statements in the loop are independent of each other (i.e. where statements that occur earlier in the loop do not affect statements that follow them), the statements can potentially be executed in parallel
    Parallel computing
    Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

    .
  • Can be implemented dynamically if the number of array elements is unknown at compile time


Optimizing compilers will sometimes perform the unrolling automatically or upon request.

Disadvantages

  • The program code size increase after the unrolling, which is undesirable, particularly for embedded applications.
  • Increased code size can also cause an increase in instruction cache misses, which may adversely affect performance.
  • Unless performed transparently by an optimizing compiler, the code may become less readable.
  • If the code in the body of the loop involves function calls, it may not be possible to combine unrolling with inlining, since the increase in code size might be excessive. Thus there can be a tradeoff between the two optimizations.
  • Possible increased register usage in a single iteration to store temporary variables , which may reduce performance (though much will depend on possible optimizations).

Static/Manual Loop Unrolling

Manual (or static) loop unrolling involves the programmer analyzing the loop and interpreting the iterations into a sequence of instructions which will reduce the loop overhead. This is in contrast to dynamic unrolling which is accomplished by the compiler.

A simple manual example in C Language

A procedure in a computer program is to delete 100 items from a collection. This is normally accomplished by means of a for-loop which calls the function delete(item_number). If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared to those for delete(x) loop, unwinding can be used to speed it up.
Normal loop After loop unrolling


int x;
for (x = 0; x < 100; x++)
{
delete(x);
}


int x;
for (x = 0; x < 100; x+=5)
{
delete(x);
delete(x+1);
delete(x+2);
delete(x+3);
delete(x+4);
}

As a result of this modification, the new program has to make only 20 iterations, instead of 100. Afterwards, only 20% of the jumps and conditional branches need to be taken, and represents, over many iterations, a potentially significant decrease in the loop administration overhead. To produce the optimal benefit, no variables should be specified in the unrolled code that require pointer arithmetic. This usually requires "base
Base address
In computing, a base address is an address serving as a reference point for other addresses.In computers using relative addressing scheme, to obtain an absolute address, the relevant base address is taken and offset is added to it....

 plus offset" addressing, rather than indexed referencing.

On the other hand, this manual loop unrolling expands the source code size from 3 lines to 7, that have to be produced, checked, and debugged, and the compiler may have to allocate more registers to store variables in the expanded loop iteration . In addition, the loop control variables and number of operations inside the unrolled loop structure have to be chosen carefully so that the result is indeed the same as in the original code (assuming this is a later optimization on already working code). For example, consider the implications if the iteration count were not divisible by 5. The manual amendments required also become somewhat more complicated if the test conditions are variables. See also Duff's device
Duff's device
In computer science, Duff's device is an optimized implementation of a serial copy that uses a technique widely applied in assembly language for loop unwinding. Its discovery is credited to Tom Duff in November of 1983, who at the time was working for Lucasfilm. It is perhaps the most dramatic...

.

Early complexity

In the simple case the loop control is merely an administrative overhead, that arranges the productive statements. The loop itself contributes nothing to the results desired, merely saving the programmer the tedium of replicating the code a hundred times which could have been done by a pre-processor generating the replications, or a text editor. Similarly, If-statements, and other flow control statements, could be replaced by code replication, except that code bloat
Code bloat
Code bloat is the production of code that is perceived as unnecessarily long, slow, or otherwise wasteful of resources. Code bloat can be caused by inadequacies in the language in which the code is written, inadequacies in the compiler used to compile the code, or by a programmer...

 can be the result. Computer programs easily track the combinations, but programmers find this repetition boring and make mistakes.
Consider:
Normal loop After loop unrolling


for i:=1:8 do
if i mod 2 = 0 then do_evenstuff(i)
else do_oddstuff(i);
next i;


do_oddstuff(1); do_evenstuff(2);
do_oddstuff(3); do_evenstuff(4);
do_oddstuff(5); do_evenstuff(6);
do_oddstuff(7); do_evenstuff(8);


But of course, the code performed need not be the invocation of a procedure, and this next example involves the index variable in computation:
Normal loop After loop unrolling

x(1) = 1;
For i:=2:9 do
x(i):=x(i - 1)*i;
print i,x(i);
next i;


x(1):=1;
x(2):=x(1)*2; print 2,x(2);
x(3):=x(2)*3; print 3,x(3);
...etc.
.


which, if compiled, might produce a lot of code (print statements being notorious) but further optimization is possible. This example make reference only to x(i) and x(i - 1) in the loop (the latter only to develop the new value x(i)) therefore, given that there is no later reference to the array x developed here, its usages could be replaced by a simple variable. Such a change would however mean a simple variable whose value is changed whereas if staying with the array, the compiler's analysis might note that the array's values are constant, each derived from a previous constant, and therfore carry forwards the constant values so that the code becomes
print 2,2;
print 3,6;
print 4,24;
...etc.
It would be quite a surprise if the compiler were to recognise x(n) = Factorial(n).

In general, the content of a loop might be large, involving intricate array indexing. These cases are probably best left to optimizing compilers to unroll. Replicating innermost loops might allow many possible optimisations yet yield only a small gain unless n is large.

Unrolling WHILE loops

A pseudocode WHILE loop - similar to the following -
Normal loop After loop unrolling Unrolled & "tweaked" loop


WHILE (condition) DO
action
ENDWHILE
.
.
.
.
.
.


WHILE (condition) DO
action
IF NOT(condition) THEN GOTO break
action
IF NOT(condition) THEN GOTO break
action
ENDWHILE
LABEL break:
.


IF (condition) THEN
REPEAT {
action
IF NOT(condition) THEN GOTO break
action
IF NOT(condition) THEN GOTO break
action
} WHILE (condition)
LABEL break:

Unrolling is faster because the ENDWHILE (that will be compiled to a jump to the start of the loop) will be executed 66% less often.

Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.

Dynamic unrolling

Since the benefits of loop unrolling are frequently dependent on the size of an array - that may often not be known until run time, JIT
Just-in-time compilation
In computing, just-in-time compilation , also known as dynamic translation, is a method to improve the runtime performance of computer programs. Historically, computer programs had two modes of runtime operation, either interpreted or static compilation...

 compilers (for example) can determine whether to invoke a "standard" loop sequence or instead generate a (relatively short) sequence of individual instructions for each element. This flexibility is one of the advantages of just-in-time techniques versus static or manual optimization in the context of loop unrolling. In this situation, it is often with relatively small values of 'n' where the savings are still useful - requiring quite small (if any) overall increase in program size (that might be included just once, as part of a standard library).

Assembly language
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...

 programmers (including optimizing compiler writers) are also able to benefit from the technique of dynamic loop unrolling, using a method similar to that used for efficient branch table
Branch table
In computer programming, a branch table is a term used to describe an efficient method of transferring program control to another part of a program using a table of branch instructions. It is a form of multiway branch...

s. Here the advantage is greatest where the maximum offset of any referenced field in a particular array is less than the maximum offset that can be specified in a machine instruction (which will be flagged by the assembler if exceeded). The example below is for IBM/360 or Z/Architecture
Z/Architecture
z/Architecture, initially and briefly called ESA Modal Extensions , refers to IBM's 64-bit computing architecture for IBM mainframe computers. IBM introduced its first z/Architecture-based system, the zSeries Model 900, in late 2000. Later z/Architecture systems include the IBM z800, z990, z890,...

 assemblers and assumes a field of 100 bytes (at offset zero) is to be copied from array 'FROM' to array 'TO' - both having element lengths of 256 bytes with 50 entries

Assembler example (IBM/360 or Z/Architecture)

For an x86 example, see External Links.


* initialize all the registers to point to arrays etc (R14 is return address)
LM R15,R2,INIT load R15= '16', R0=number in array, R1--> 'FROM array', R2--> 'TO array'
LOOP EQU *
SR R15,R0 get 16 minus the number in the array
BNP ALL if n > 16 need to do all of the sequence, then repeat
* (if # entries = zero, R15 will now still be 16, so all the MVC's will be bypassed)
* calculate an offset (from start of MVC sequence) for unconditional branch to 'unwound' MVC loop
MH R15,=AL2(ILEN) multiply by length of (MVC..) instruction (=6 in this example)
B ALL(R15) indexed branch instruction (to MVC with drop through)
* Assignment (move) table - (first entry has maximum allowable offset with single register = X'F00' in this example)
ALL MVC 15*256(100,R2),15*256(R1) * move 100 bytes of 16th entry from array 1 to array 2 (with drop through)
ILEN EQU *-ALL length of (MVC...) instruction sequence; in this case =6
MVC 14*256(100,R2),14*256(R1) *
MVC 13*256(100,R2),13*256(R1) *
MVC 12*256(100,R2),12*256(R1) * All 16 of these 'move character' instructions use base plus offset addressing
MVC 11*256(100,R2),11*256(R1) * and each to/from offset decreases by the length of one array element (256).
MVC 10*256(100,R2),10*256(R1) * This avoids pointer arithmetic being required for each element up to a
MVC 09*256(100,R2),09*256(R1) * maximum permissible offset within the instruction of X'FFF'. The instructions
MVC 08*256(100,R2),08*256(R1) * are in order of decreasing offset, so the first element in the set is moved
MVC 07*256(100,R2),07*256(R1) * last.
MVC 06*256(100,R2),06*256(R1) *
MVC 05*256(100,R2),05*256(R1) *
MVC 04*256(100,R2),04*256(R1) *
MVC 03*256(100,R2),03*256(R1) *
MVC 02*256(100,R2),02*256(R1) *
MVC 01*256(100,R2),01*256(R1) move 100 bytes of 2nd entry
MVC 00*256(100,R2),00*256(R1) move 100 bytes of 1st entry
*
S R0,MAXM1 reduce Count = remaining entries to process
BNPR R14 ... no more, so return to address in R14
AH R1,=AL2(16*256) increment 'FROM' register pointer beyond first set
AH R2,=AL2(16*256) increment 'TO' register pointer beyond first set
L R15,MAXM1 re-load (maximum MVC's) in R15 (destroyed by calculation earlier)
B LOOP go and execute loop again
*
* ----- Define static Constants and variables (These could be passed as parameters) --------------------------------- *
INIT DS 0A 4 addresses (pointers) to be pre-loaded with a 'LM' instruction
MAXM1 DC A(16) maximum MVC's
N DC A(50) number of actual entries in array (a variable, set elsewhere)
DC A(FROM) address of start of array 1 ("pointer")
DC A(TO) address of start of array 2 ("pointer")
* ----- Define static Arrays (These could be dynamically acquired) -------------------------------------------------- *
FROM DS 50CL256 array of (max) 50 entries of 256 bytes each
TO DS 50CL256 array of (max) 50 entries of 256 bytes each

In this example, approximately 202 instructions would be required with a 'conventional' loop (50 iterations) whereas the above dynamic code would require only about 89 instructions (or a saving of approximately 56%). If the array had consisted of only 2 entries, it would still execute in approximately the same time as the original unwound loop. The increase in code
Binary file
A binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...

 size is only about 108 bytes - even if there are thousands of entries in the array. Similar techniques can of course be used where multiple instructions are involved, as long as the combined instruction length is adjusted accordingly. For example, in this same example, if it is required to clear the rest of each array entry to nulls immediately after the 100 byte field copied, an additional clear instruction, 'XC xx*256+100(156,R1),xx*256+100(R2)', can be added immediately after every MVC in the sequence (where 'xx' matches the value in the MVC above it).

It is, of course, perfectly possible to generate the above code 'inline' using a single assembler macro statement, specifying just 4 or 5 operands (or alternatively, make it into a library subroutine, accessed by a simple call, passing a list of parameters), making the optimization readily accessible to inexperienced programmers.

C example

The following example demonstrates dynamic loop unrolling for a simple program written in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

. Unlike the assembler example above, pointer/index arithmetic is still generated by the compiler in this example because a variable (i) is still used to address the array element. Full optimization is only possible if absolute indexes are used in the replacement statements.
  1. include

  1. define TOGETHER (8)


int main(void)
{
int i = 0;
int entries = 50; /* total number to process */
int repeat; /* number of times for while.. */
int left = 0; /* remainder (process later) */

/* If the number of elements is not be divisible by BLOCKSIZE, */
/* get repeat times required to do most processing in the while loop */

repeat = (entries / TOGETHER); /* number of times to repeat */
left = (entries % TOGETHER); /* calculate remainder */

/* Unroll the loop in 'bunches' of 8 */
while( repeat-- > 0 )
{
printf("process(%d)\n", i );
printf("process(%d)\n", i + 1);
printf("process(%d)\n", i + 2);
printf("process(%d)\n", i + 3);
printf("process(%d)\n", i + 4);
printf("process(%d)\n", i + 5);
printf("process(%d)\n", i + 6);
printf("process(%d)\n", i + 7);

/* update the index by amount processed in one go */
i += TOGETHER;
}

/* Use a switch statement to process remaining by jumping to the case label */
/* at the label that will then drop through to complete the set */
switch (left)
{
case 7 : printf("process(%d)\n", i + 6); /* process and rely on drop through drop through */
case 6 : printf("process(%d)\n", i + 5);
case 5 : printf("process(%d)\n", i + 4);
case 4 : printf("process(%d)\n", i + 3);
case 3 : printf("process(%d)\n", i + 2);
case 2 : printf("process(%d)\n", i + 1); /* two left */
case 1 : printf("process(%d)\n", i ); /* just one left to process */
case 0 : ; /* none left */
}
}

Further reading

See also

  • Just-in-time compilation
    Just-in-time compilation
    In computing, just-in-time compilation , also known as dynamic translation, is a method to improve the runtime performance of computer programs. Historically, computer programs had two modes of runtime operation, either interpreted or static compilation...

  • Loop splitting
    Loop splitting
    Loop splitting is a compiler optimization technique. It attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range.-Loop peeling:...

  • Loop fusion
    Loop fusion
    Loop fusion, also called loop jamming, is a compiler optimization, a loop transformation, which replaces multiple loops with a single one.- Example in C : int i, a[100], b[100]; for Loop fusion, also called loop jamming, is a compiler optimization, a loop transformation, which replaces multiple...

  • Duff's device
    Duff's device
    In computer science, Duff's device is an optimized implementation of a serial copy that uses a technique widely applied in assembly language for loop unwinding. Its discovery is credited to Tom Duff in November of 1983, who at the time was working for Lucasfilm. It is perhaps the most dramatic...

  • Instruction level parallelism
    Instruction level parallelism
    Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f...

  • Parallel Computing
    Parallel computing
    Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...


External links

  • Chapter 7, pages 8 to 10, of Michael Abrash
    Michael Abrash
    Michael Abrash is a technical writer specializing in optimization and 80x86 assembly language programming, a reputation cemented by his 1990 book Zen of Assembly Language Volume 1: Knowledge. The original 8086 processor, the focus of the book, was several generations behind the state of the art by...

    's Graphics Programming Black Book is about loop unrolling, with an example in x86 assembly.
  • http://www.cs.uh.edu/~jhuang/JCH/JC/loop.pdf Generalized Loop Unrolling, gives a concise introduction.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK