Duff's device
Encyclopedia
In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, Duff's device is an optimized
Optimization (computer science)
In computer science, program optimization or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources...

 implementation
Implementation
Implementation is the realization of an application, or execution of a plan, idea, model, design, specification, standard, algorithm, or policy.-Computer Science:...

 of a serial copy that uses a technique widely applied in assembly language
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...

 for loop unwinding
Loop unwinding
Loop unwinding, also known as loop unrolling, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size...

. Its discovery is credited to Tom Duff
Tom Duff
Thomas Douglas Selkirk Duff is a computer programmer. He was born in Toronto, Ontario, Canada and grew up in Toronto and Leaside. In 1974 he graduated from the University of Waterloo with a B.Math and, two years later, got an M.Sc...

 in November of 1983, who at the time was working for Lucasfilm
Lucasfilm
Lucasfilm Limited is an American film production company founded by George Lucas in 1971, based in San Francisco, California. Lucas is the company's current chairman and CEO, and Micheline Chau is the president and COO....

. It is perhaps the most dramatic use of case label fall-through
Switch statement
In computer programming, a switch, case, select or inspect statement is a type of selection control mechanism that exists in most imperative programming languages such as Pascal, Ada, C/C++, C#, Java, and so on. It is also included in several other types of languages...

 in the C programming language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 to date. Duff does not claim credit for discovering the concept of loop unrolling
Loop unwinding
Loop unwinding, also known as loop unrolling, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size...

, just this particular expression of it in C.

Background

Loop-unrolling revolves around lowering the number of branches made, by batching them together. To handle cases where the number of iterations is not divisible by the unrolled-loop increments, a common technique is to jump directly into the middle of the unrolled loop for copying the remainder.

Duff was looking for a similar optimization for his case, and succeeded in doing so in C, unrolling a loop into a loop which assigns (up to) 8 values on each iteration.

Original version

Traditionally, a serial copy would look like:


do { /* count > 0 assumed */
*to = *from++; /* Note that the 'to' pointer is NOT incremented */
} while(--count > 0);


Note that to is not incremented because Duff was copying to a single memory-mapped output
Memory-mapped I/O
Memory-mapped I/O and port I/O are two complementary methods of performing input/output between the CPU and peripheral devices in a computer...

 register.

While optimizing this, Duff realized that an unrolled version of his loop could be implemented by interlacing the structures of a switch and a loop.


send(to, from, count)
register short *to, *from;
register count;
{
register n = (count + 7) / 8;
switch(count % 8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while(--n > 0);
}
}


Notice that device can just as easily be applied with any other size for the unrolled loop, not just 8.

Reason it works

Based on an algorithm used widely by programmers coding in assembly
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...

 for minimizing the number of tests and branches during a copy, Duff's device appears out of place when implemented in C. The device is valid, legal C by virtue of two attributes in C:
  1. Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the controlled statement of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall-through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.
  2. The ability to legally jump into the middle of a loop in C.


Note that, as documented in the comment appearing in Duff's un-optimized version, the code assumes that count is strictly positive.

Performance

Many compilers will optimize the switch into a jump table just as would be done in an assembler implementation. C's default fall-through in case statements has long been one of its most controversial features; Duff observed that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."

The primary increase in speed versus a simple, straightforward loop comes from loop unwinding
Loop unwinding
Loop unwinding, also known as loop unrolling, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size...

, which reduces the number of branches performed (which are computationally expensive due to the need to flush - and hence stall - the pipeline). The switch/case statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, 8 byte moves are unrolled, so the switch/case handles an extra 1–7 bytes automatically).

This automatic handling of the remainder may not be the best solution on all systems and compilers — in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and branch prediction on some architectures. When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance. Therefore, when considering using this code, it may be worth running a few benchmarks
Benchmark (computing)
In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it...

 to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler.

Stroustrup's version

The original Device was made for copying to a (memory-mapped) register. To actually copy memory from one location to another, an auto-increment must be added to every reference to to, like so:

*to++ = *from++;

This modified form of the Device appears as a "what does this code do?" exercise in Bjarne Stroustrup
Bjarne Stroustrup
Bjarne Stroustrup ; born December 30, 1950 in Århus, Denmark) is a Danish computer scientist, most notable for the creation and the development of the widely used C++ programming language...

's book The C++ Programming Language, presumably because novice programmers cannot be expected to know about memory-mapped output registers. However, the standard C library provides the function memcpy for this purpose; it will not perform worse than this code, and may contain architecture specific optimizations that will make it significantly faster.

Books

  • Stroustrup, Bjarne
    Bjarne Stroustrup
    Bjarne Stroustrup ; born December 30, 1950 in Århus, Denmark) is a Danish computer scientist, most notable for the creation and the development of the widely used C++ programming language...

    , The C++ Programming Language, Third Edition. Addison-Wesley, ISBN 0-201-88954-4
  • Kernighan, Brian
    Brian Kernighan
    Brian Wilson Kernighan is a Canadian computer scientist who worked at Bell Labs alongside Unix creators Ken Thompson and Dennis Ritchie and contributed to the development of Unix. He is also coauthor of the AWK and AMPL programming languages. The 'K' of K&R C and the 'K' in AWK both stand for...

     and Dennis Ritchie
    Dennis Ritchie
    Dennis MacAlistair Ritchie , was an American computer scientist who "helped shape the digital era." He created the C programming language and, with long-time colleague Ken Thompson, the UNIX operating system...

    , The C Programming Language
    The C Programming Language (book)
    The C Programming Language is a well-known programming book written by Brian Kernighan and Dennis Ritchie, the latter of whom originally designed and implemented the language, as well as co-designed the Unix operating system with which development of the language was closely intertwined...

    .

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK