Strength reduction - AbsoluteAstronomy.com

Strength reduction is a compiler optimization

Compiler optimization

Compiler optimization is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the amount of memory occupied...

where expensive operations are replaced with equivalent but less expensive operations. The classic example of strength reduction converts "strong" multiplications inside a loop into "weaker" additions – something that frequently occurs in array addressing.

Examples of strength reduction include:

replacing a multiplication within a loop with an addition
replacing an exponentiation within a loop with a multiplication

Code analysis

Most of a program's execution time is typically spent in a small section of code, and that code is often inside a loop that is executed over and over.

A compiler uses methods to identify loops and recognize the characteristics of register values within those loops. For strength reduction, the compiler is interested in

loop invariants. These are values that do not change within the body of a loop.
induction variables. These are the values that are being iterated each time through the loop.

Loop invariants are essentially constants within a loop, but their value may change outside of the loop. Induction variables are changing by known amounts. The terms are relative to a particular loop. When loops are nested, an induction variable in the outer loop can be a loop invariant in the inner loop.

Strength reduction looks for expressions involving a loop invariant and an induction variable. Some of those expressions can be simplified. For example, the multiplication of loop invariant c and induction variable i



c = 8;

for (i = 0; i < N; i++)

  {

    y[i] = c * i;

  }

can be replaced with successive weaker additions



c = 8;

k = 0;

for (i = 0; i < N; i++)

  {

    y[i] = k;

    k = k + c;

  }

Strength reduction example

Below is an example that will strength reduce all the loop multiplications that arose from array indexing address calculations.

Imagine a simple loop that sets an array to the identity matrix.



for (i = 0; i < n; i++)

  {

    for (j = 0; j < n; j++)

      {

         A[i,j] = 0.0;

      }

    A[i,i] = 1.0;

  }

Intermediate Code

The compiler will view this code as



0010  // for (i = 0, i < n; i++)

0020    {

0030      r1 = #0                        // i = 0

0040  G0000:

0050      load r2, n                     // i < n

0060      cmp r1, r2

0070      bge G0001

0080

0090      // for (j = 0; j < n; j++)

0100        {

0110           r3   = #0                 // j = 0

0120  G0002:

0130           load r4, n                // j < n

0140           cmp r3, r4

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0180           load r7, n

0190           r8  = r1 * r7             // calculate subscript i * n + j

0200           r9  = r8 + r3

0210           r10 = r9 * #8             // calculate byte address

0220           fr3 = #0.0

0230           fstore fr3, A[r10]

0240

0250           r3 = r3 + #1              // j++

0260           br G0002

0270        }

0280  G0003:

0290      // A[i,i] = 1.0;

0300      load r12, n                    // calculate subscript i * n + i

0310      r13 = r1 * r12            

0320      r14 = r13 + r1

0330      r15 = r14 * #8                 // calculate byte address

0340      fr4 = #1.0

0350      fstore fr4, A[r15]

0360

0370      //i++

0380      r1 = r1 + #1

0390      br G0000

0400    }

0410  G0001:

Many optimizations

The compiler will start doing many optimizations – not just strength reduction. Expressions that are constant (invariant) within a loop will be hoisted

Loop-invariant code motion

In computer programming, loop-invariant code consists of statements or expressions which can be moved outside the body of a loop without affecting the semantics of the program...

out of the loop. Constants can be loaded outside of both loops—such as floating point registers fr3 and fr4. Recognition that some variables don't change allow registers to be merged; n is constant, so r2, r4, r7, r12 can be hoisted and collapsed. The common value i*n is computed in (the hoisted) r8 and r13, so they collapse. The innermost loop (0120-0260) has been reduced from 11 to 7 intermediate instructions. The only multiply that remains in the innermost loop is line 0210's multiply by 8.



0010  // for (i = 0, i < n; i++)

0020    {

0030      r1 = #0                        // i = 0

0050      load r2, n

0130 //   load r4, n     killed; use r2

0180 //   load r7, n     killed; use r2

0300 //   load r12, n    killed; use r2

0220      fr3 = #0.0

0340      fr4 = #1.0

0040  G0000:

0060      cmp r1, r2                     // i < n

0070      bge G0001

0080

0190      r8  = r1 * r2                  // calculate subscript i * n

0310 //   r13 = r1 * r2  killed; use r8  // calculate subscript i * n

0090      // for (j = 0; j < n; j++)

0100        {

0110           r3   = #0                 // j = 0

0120  G0002:

0140           cmp r3, r2                // j < n

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0200           r9  = r8 + r3             // calculate subscript i * n + j

0210           r10 = r9 * #8             // calculate byte address

0230           fstore fr3, A[r10]

0240

0250           r3 = r3 + #1              // j++

0260           br G0002

0270        }

0280  G0003:

0290      // A[i,i] = 1.0;

0320      r14 = r8  + r1                 // calculate subscript i * n + i

0330      r15 = r14 * #8                 // calculate byte address

0350      fstore fr4, A[r15]

0360

0370      //i++

0380      r1 = r1 + #1

0390      br G0000

0400    }

0410  G0001:

There are more optimizations to go. Register r3 is the main variable in the innermost loop (0140-0260); it gets incremented by 1 each time through the loop. Register r8 (which is invariant in the innermost loop) is added to r3. Instead of using r3, the compiler can eliminate r3 and use r9. The loop, instead of being controlled by r3 = 0 to n-1, it can be controlled by r9=r8+0 to r8+n-1. That adds four instructions and kills four instructions, but there's one fewer instructions inside the loop



0110  //       r3   = #0     killed      // j = 0

0115           r9   = r8                 // new assignment

0117           r20  = r8 + r2            // new limit

0120  G0002:

0140  //       cmp r3, r2    killed      // j < n

0145           cmp r9, r20               // r8 + j < r8 + n = r20

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0200  //       r9  = r8 + r3 killed      // calculate subscript i * n + j

0210           r10 = r9 * #8             // calculate byte address

0230           fstore fr3, A[r10]

0240

0250  //       r3 = r3 + #1  killed      // j++

0255           r9 = r9 + #1              // new loop variable

0260           br G0002

Now r9 is the loop variable, but it interacts with the multiply by 8. Here we get to do some strength reduction. The multiply by 8 can be reduced to some successive additions of 8. Now there are no multiplications inside the loop.



0115           r9   = r8                 // new assignment

0117           r20  = r8 + r2            // new limit

0118           r10  = r8 * #8            // initial value of r10

0120  G0002:

0145           cmp r9, r20               // r8 + j < r8 + n = r20

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0210  //       r10 = r9 * #8    killed   // calculate byte address

0230           fstore fr3, A[r10]

0240

0245           r10 = r10 + #8            // strength reduced multiply

0255           r9 = r9 + #1              // loop variable

0260           br G0002

Registers r9 and r10 (= 8*r9) aren't both needed; r9 can be eliminated in the loop. The loop is now 5 instructions.



0115  //       r9   = r8         killed

0117           r20  = r8 + r2            // limit

0118           r10  = r8 * #8            // initial value of r10

0119           r22  = r20 * #8           // new limit

0120  G0002:

0145  //       cmp r9, r20       killed  // r8 + j < r8 + n = r20

0147           cmp r10, r22              // r10 = 8*(r8 + j) < 8*(r8 + n) = r22

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0230           fstore fr3, A[r10]

0240

0245           r10 = r10 + #8            // strength reduced multiply

0255  //       r9 = r9 + #1      killed  // loop variable

0260           br G0002

Outer loop

Back to the whole picture:



0010  // for (i = 0, i < n; i++)

0020    {

0030      r1 = #0                        // i = 0

0050      load r2, n

0220      fr3 = #0.0

0340      fr4 = #1.0

0040  G0000:

0060      cmp r1, r2                     // i < n

0070      bge G0001

0080

0190      r8   = r1 * r2                 // calculate subscript i * n

0117      r20  = r8 + r2                 // limit

0118      r10  = r8 * #8                 // initial value of r10

0119      r22  = r20 * #8                // new limit

0090      // for (j = 0; j < n; j++)

0100        {

0120  G0002:

0147           cmp r10, r22              // r10 = 8*(r8 + j) < 8*(r8 + n) = r22

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0230           fstore fr3, A[r10]

0240

0245           r10 = r10 + #8            // strength reduced multiply

0260           br G0002

0270        }

0280  G0003:

0290      // A[i,i] = 1.0;

0320      r14 = r8  + r1                 // calculate subscript i * n + i

0330      r15 = r14 * #8                 // calculate byte address

0350      fstore fr4, A[r15]

0360

0370      //i++

0380      r1 = r1 + #1

0390      br G0000

0400    }

0410  G0001:

There are now four multiplications within the outer loop that increments r1. Register r8 = r1*r2 at 0190 can be strength reduced by setting it before entering the loop (0055) and incrementing it by r2 at the bottom of the loop (0385).

The value r8*8 (at 0118) can strength reduced by initializing it (0056) and adding 8*r2 to it when r8 gets incremented (0386).

Register r20 is being incremented by the invariant/constant r2 each time through the loop at 0117. After being incremented, it is multiplied by 8 to create r22 at 0119. That multiplication can be strength reduced by adding 8*r2 each time through the loop.



0010  // for (i = 0, i < n; i++)

0020    {

0030      r1 = #0                        // i = 0

0050      load r2, n

0220      fr3 = #0.0

0340      fr4 = #1.0

0055      r8  = r1 * r2                  // set initial value for r8

0056      r40 = r8 * #8                  // initial value for r8 * 8

0057      r30 = r2 * #8                  // increment for r40

0058      r20 = r8 + r2                  // copied from 0117

0058      r22 = r20 * #8                 // initial value of r22

0040  G0000:

0060      cmp r1, r2                     // i < n

0070      bge G0001

0080

0190  //  r8   = r1 * r2    killed       // calculate subscript i * n

0117  //  r20  = r8 + r2    killed - dead code

0118      r10  = r40                     // strength reduced expression to r40

0119  //  r22  = r20 * #8   killed       // strength reduced

0090      // for (j = 0; j < n; j++)

0100        {

0120  G0002:

0147           cmp r10, r22              // r10 = 8*(r8 + j) < 8*(r8 + n) = r22

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0230           fstore fr3, A[r10]

0240

0245           r10 = r10 + #8            // strength reduced multiply

0260           br G0002

0270        }

0280  G0003:

0290      // A[i,i] = 1.0;

0320      r14 = r8  + r1                 // calculate subscript i * n + i

0330      r15 = r14 * #8                 // calculate byte address

0350      fstore fr4, A[r15]

0360

0370      //i++

0380      r1  = r1 + #1

0385      r8  = r8 + r2                  // strength reduce r8 = r1 * r2

0386      r40 = r40 + r30                // strength reduce expression r8 * 8

0388      r22 = r22 + r30                // strength reduce r22 = r20 * 8

0390      br G0000

0400    }

0410  G0001:

The last multiply

That leaves the two loops with only one multiplication operation (at 0330) within the outer loop and no multiplications within the inner loop.



0010  // for (i = 0, i < n; i++)

0020    {

0030      r1 = #0                        // i = 0

0050      load r2, n

0220      fr3 = #0.0

0340      fr4 = #1.0

0055      r8  = r1 * r2                  // set initial value for r8

0056      r40 = r8 * #8                  // initial value for r8 * 8

0057      r30 = r2 * #8                  // increment for r40

0058      r20 = r8 + r2                  // copied from 0117

0058      r22 = r20 * #8                 // initial value of r22

0040  G0000:

0060      cmp r1, r2                     // i < n

0070      bge G0001

0080

0118      r10  = r40                     // strength reduced expression to r40

0090      // for (j = 0; j < n; j++)

0100        {

0120  G0002:

0147           cmp r10, r22              // r10 = 8*(r8 + j) < 8*(r8 + n) = r22

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0230           fstore fr3, A[r10]

0240

0245           r10 = r10 + #8            // strength reduced multiply

0260           br G0002

0270        }

0280  G0003:

0290      // A[i,i] = 1.0;

0320      r14 = r8  + r1                 // calculate subscript i * n + i

0330      r15 = r14 * #8                 // calculate byte address

0350      fstore fr4, A[r15]

0360

0370      //i++

0380      r1  = r1 + #1

0385      r8  = r8 + r2                  // strength reduce r8 = r1 * r2

0386      r40 = r40 + r30                // strength reduce expression r8 * 8

0388      r22 = r22 + r30                // strength reduce r22 = r20 * 8

0390      br G0000

0400    }

0410  G0001:

At line 0320, r14 is the sum of r8 and r1, and r8 and r1 are being incremented in the loop. Register r8 is being bumped by r2 (=n) and r1 is being bumped by 1. Consequently, r14 is being bumped by n+1 each time through the loop. The last loop multiply at 0330 can be strength reduced by adding (r2+1)*8 each time through the loop.



0010  // for (i = 0, i < n; i++)

0020    {

0030      r1 = #0                        // i = 0

0050      load r2, n

0220      fr3 = #0.0

0340      fr4 = #1.0

0055      r8  = r1 * r2                  // set initial value for r8

0056      r40 = r8 * #8                  // initial value for r8 * 8

0057      r30 = r2 * #8                  // increment for r40

0058      r20 = r8 + r2                  // copied from 0117

0058      r22 = r20 * #8                 // initial value of r22

005A      r14 = r8 + #1                  // copied from 0320

005B      r15 = r14 * #8                 // initial value of r15 (0330)

005C      r49 = r2 + 1

005D      r50 = r49 * #8                 // strength reduced increment

0040  G0000:

0060      cmp r1, r2                     // i < n

0070      bge G0001

0080

0118      r10  = r40                     // strength reduced expression to r40

0090      // for (j = 0; j < n; j++)

0100        {

0120  G0002:

0147           cmp r10, r22              // r10 = 8*(r8 + j) < 8*(r8 + n) = r22

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0230           fstore fr3, A[r10]

0240

0245           r10 = r10 + #8            // strength reduced multiply

0260           br G0002

0270        }

0280  G0003:

0290      // A[i,i] = 1.0;

0320  //  r14 = r8  + r1     killed      // dead code

0330  //  r15 = r14 * #8     killed      // strength reduced

0350      fstore fr4, A[r15]

0360

0370      //i++

0380      r1  = r1 + #1

0385      r8  = r8 + r2                  // strength reduce r8 = r1 * r2

0386      r40 = r40 + r30                // strength reduce expression r8 * 8

0388      r22 = r22 + r30                // strength reduce r22 = r20 * 8

0389      r15 = r15 + r50                // strength reduce r15 = r14 * 8

0390      br G0000

0400    }

0410  G0001:

There's still more to go. Constant folding will recognize that r1=0 in the preamble, so several instruction will clean up. Register r8 isn't used in the loop, so it can disappear.

Furthermore, r1 is only being used to control the loop, so r1 can be replaced by a different induction variable such as r40. Where i went 0 <= i < n, register r40 goes 0 <= r40 < 8 * n * n.



0010  // for (i = 0, i < n; i++)

0020    {

0030  //  r1 = #0                        // i = 0, becomes dead code

0050      load r2, n

0220      fr3 = #0.0

0340      fr4 = #1.0

0055  //  r8  = #0         killed        // r8 no longer used

0056      r40 = #0                       // initial value for r8 * 8

0057      r30 = r2 * #8                  // increment for r40

0058  //  r20 = r2         killed        // r8 = 0, becomes dead code

0058      r22 = r2  * #8                 // r20 = r2

005A  //  r14 =       #1   killed        // r8 = 0, becomes dead code

005B      r15 =       #8                 // r14 = 1

005C      r49 = r2 + 1

005D      r50 = r49 * #8                 // strength reduced increment

005D      r60 = r2 * r30                 // new limit for r40

0040  G0000:

0060  //  cmp r1, r2       killed        // i < n; induction variable replaced

0065      cmp r40, r60                   // i * 8 * n < 8 * n * n

0070      bge G0001

0080

0118      r10  = r40                     // strength reduced expression to r40

0090      // for (j = 0; j < n; j++)

0100        {

0120  G0002:

0147           cmp r10, r22              // r10 = 8*(r8 + j) < 8*(r8 + n) = r22

0150           bge G0003

0160

0170           // A[i,j] = 0.0;

0230           fstore fr3, A[r10]

0240

0245           r10 = r10 + #8            // strength reduced multiply

0260           br G0002

0270        }

0280  G0003:

0290      // A[i,i] = 1.0;

0350      fstore fr4, A[r15]

0360

0370      //i++

0380  //  r1  = r1 + #1      killed      // dead code (r40 controls loop)

0385  //  r8  = r8 + r2      killed      // dead code

0386      r40 = r40 + r30                // strength reduce expression r8 * 8

0388      r22 = r22 + r30                // strength reduce r22 = r20 * 8

0389      r15 = r15 + r50                // strength reduce r15 = r14 * 8

0390      br G0000

0400    }

0410  G0001:

Other strength reduction operations

This material is disputed. It is better described as peephole optimizations and instruction assignment.

Operator strength reduction uses mathematical identities to replace slow math operations with faster operations. The benefits depend on the target CPU and sometimes on the surrounding code (which can affect the availability of other functional units within the CPU).

replacing integer division or multiplication by a power of 2 with an arithmetic shift
Arithmetic shift
In computer programming, an arithmetic shift is a shift operator, sometimes known as a signed shift . For binary numbers it is a bitwise operation that shifts all of the bits of its operand; every bit in the operand is simply moved a given number of bit positions, and the vacant bit-positions are...

or logical shift
Logical shift
In computer science, a logical shift is a bitwise operation that shifts all the bits of its operand. Unlike an arithmetic shift, a logical shift does not preserve a number's sign bit or distinguish a number's exponent from its mantissa; every bit in the operand is simply moved a given number of bit...
replacing integer multiplication by a constant with a combination of shifts, adds or subtracts

original calculation	replacement calculation
y = x / 8	y = x >> 3
y = x * 64	y = x << 6
y = x * 2	y = x + x
y = x * 15	y = (x << 4) - x

Induction variable (orphan)

Induction variable

In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop, or is a linear function of another induction variable....

or recursive strength reduction replaces a function of some systematically changing variable with a simpler calculation using previous values of the function. In a procedural programming language this would apply to an expression involving a loop variable and in a declarative language it would apply to the argument of a recursive function

Recursion (computer science)

Recursion in computer science is a method where the solution to a problem depends on solutions to smaller instances of the same problem. The approach can be applied to many types of problems, and is one of the central ideas of computer science....

. For example,

f x = ... (2 ** x) ... (f (x + 1)) ...

becomes

f x = f' x 1
where
f' x z = ... z ... (f' (x + 1) (2 * z)) ...

Here the expensive operation (2 ** x) has been replaced by the cheaper (2 * z) in the recursive function f'. This maintains the invariant that z = 2 ** x for any call to f'.