Duplicate code
Encyclopedia
Duplicate code is a computer programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...

 term for a sequence of source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...

 that occurs more than once, either within a program or across different programs owned or maintained by the same entity. Duplicate code is generally considered undesirable
Code smell
In computer programming, code smell is any symptom in the source code of a program that possibly indicates a deeper problem.Often the deeper problem hinted by a code smell can be uncovered when the code is subjected to a short feedback cycle where it is refactored in small, controlled steps, and...

 for a number of reasons. A minimum requirement is usually applied to the quantity of code that must appear in a sequence for it to be considered duplicate rather than coincidentally similar. Sequences of duplicate code are sometimes known as clones.

The following are some of the ways in which two code sequences can be duplicates of each other:
  • character-for-character identical
  • character-for-character identical with white space characters and comments being ignored
  • token-for-token identical
  • token-for-token identical with occasional variation (i.e., insertion/deletion/modification of tokens)
  • functionally identical

How duplicates are created

There are a number of reasons why duplicate code may be created, including:
  • Copy and paste programming
    Copy and paste programming
    Copy and paste programming is a pejorative term to describe highly repetitive computer programming code apparently produced by copy and paste operations...

    , in which a section of code is copied "because it works". In most cases this operation involves slight modifications in the cloned code such as renaming variables or inserting/deleting code.
  • Functionality that is very similar to that in another part of a program is required and a developer independently writes code that is very similar to what exists elsewhere.
  • Plagiarism
    Plagiarism
    Plagiarism is defined in dictionaries as the "wrongful appropriation," "close imitation," or "purloining and publication" of another author's "language, thoughts, ideas, or expressions," and the representation of them as one's own original work, but the notion remains problematic with nebulous...

    , where code is simply copied without permission or attribution.

Problems associated with duplicate code

Code duplication is generally considered a mark of poor or lazy programming style. Good coding style is generally associated with code reuse
Code reuse
Code reuse, also called software reuse, is the use of existing software, or software knowledge, to build new software.-Overview:Ad hoc code reuse has been practiced from the earliest days of programming. Programmers have always reused sections of code, templates, functions, and procedures...

. It may be slightly faster to develop by duplicating code, because the developer need not concern himself with how the code is already used or how it may be used in the future. The difficulty is that original development is only a small fraction of a product's life cycle, and with code duplication the maintenance costs are much higher. Some of the specific problems include:
  • Code bulk affects comprehension: Code duplication frequently creates long, repeated sections of code that differ in only a few lines or characters. The length of such routines can make it difficult to quickly understand them. This is in contrast to the "best practice" of code decomposition
    Decomposition (computer science)
    Decomposition in computer science, also known as factoring, refers to the process by which a complex problem or system is broken down into parts that are easier to conceive, understand, program, and maintain.- Overview :...

    .
  • Purpose masking: The repetition of largely identical code sections can conceal how they differ from one another, and therefore, what the specific purpose of each code section is. Often, the only difference is in a parameter
    Parameter (computer science)
    In computer programming, a parameter is a special kind of variable, used in a subroutine to refer to one of the pieces of data provided as input to the subroutine. These pieces of data are called arguments...

     value. The best practice in such cases is a reusable subroutine
    Subroutine
    In computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....

    .

  • Update anomalies: Duplicate code contradicts a fundamental principle of database theory that applies here: Avoid redundancy. Non-observance incurs update anomalies, which increase maintenance costs, in that any modification to a redundant piece of code must be made for each duplicate separately. At best, coding and testing time are multiplied by the number of duplications. At worst, some locations may be missed, and for example bugs thought to be fixed may persist in duplicated locations for months or years. The best practice here is a code library.

Detecting duplicate code

A number of different algorithms have been proposed to detect duplicate code. For example:
  • Baker's algorithm.
  • Rabin–Karp string search algorithm.
  • Using Abstract Syntax Trees
    Abstract syntax tree
    In computer science, an abstract syntax tree , or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is 'abstract' in the sense that it...

    .
  • Visual clone detection.

Example of functionally duplicate code

Consider the following code snippet
Snippet (programming)
Snippet is a programming term for a small region of re-usable source code, machine code or text. Ordinarily, these are formally-defined operative units to incorporate into larger programming modules...

 for calculating the average
Average
In mathematics, an average, or central tendency of a data set is a measure of the "middle" value of the data set. Average is one form of central tendency. Not all central tendencies should be considered definitions of average....

 of an array of integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

s


extern int array1[];
extern int array2[];

int sum1 = 0;
int sum2 = 0;
int average1 = 0;
int average2 = 0;

for (int i = 0; i < 4; i++)
{
sum1 += array1[i];
}
average1 = sum1/4;

for (int i = 0; i < 4; i++)
{
sum2 += array2[i];
}
average2 = sum2/4;


The two loops can be rewritten as the single function:

int calcAverage (int* Array_of_4)
{
int sum = 0;
for (int i = 0; i < 4; i++)
{
sum += Array_of_4[i];
}
return sum/4;
}

Using the above function will give source code that has no loop duplication:

extern int array1[];
extern int array2[];

int average1 = calcAverage(array1);
int average2 = calcAverage(array2);


Tools

Code duplication analysis tools include:
  • Atomiq - commercial
  • Black Duck Suite
    Black Duck Software
    Black Duck Software is a Massachusetts US private company. Black Duck Software pioneered the automation of mixed-origin software component reuse management...

     - commercial (software analyzing suite)
  • CCFinder (C/C++, Java, COBOL, Fortran, etc. / uncomfortable to compile for non-windows OS)
  • Checkstyle
    Checkstyle
    Checkstyle is a static code analysis tool used in software development for checking if Java source code complies with coding rules.- Advantages and limits :...

     (Java)
  • CloneAnalyzer (C/C++ and Java / Eclipse plugin only)
  • Clone Digger (Python and Java)
  • CloneDR - commercial (Ada, C, C++, C#, Java, COBOL, Fortran, Python, VB.net, VB6, PHP4/5, PLSQL, SQL2011, XML, many others)
  • Copy/Paste Detector (CPD) from PMD
    PMD (software)
    PMD is a static ruleset based Java source code analyzer that identifies potential problems like:* Possible bugs - Empty try/catch/finally/switch blocks.* Dead code - Unused local variables, parameters and private methods....

     (Java, JSP, C, C++, Fortran, PHP)
  • ConQAT (Open Source, supports: ABAP, ADA, Cobol, C/C++, C#, Java, PL/I, PL/SQL, Python, Text, Transact SQL, Visual Basic, XML)
  • JPlag (Java, C#, C, C++, Scheme and natural language text)
  • Pattern Miner (CP Miner) - commercial
  • Simian (software)
  • Google CodePro Analytix - (Java / Eclipse plugin only)

See also

  • Abstraction principle (programming)
    Abstraction principle (programming)
    In software engineering and programming language theory, the abstraction principle is a basic dictum that aims to reduce duplication of information in a program whenever practical by making use of abstractions provided by the programming language or software libraries...

  • Code smell
    Code smell
    In computer programming, code smell is any symptom in the source code of a program that possibly indicates a deeper problem.Often the deeper problem hinted by a code smell can be uncovered when the code is subjected to a short feedback cycle where it is refactored in small, controlled steps, and...

  • Don't repeat yourself
    Don't repeat yourself
    In software engineering, Don't Repeat Yourself is a principle of software development aimed at reducing repetition of information of all kinds, especially useful in multi-tier architectures...

  • List of tools for static code analysis
  • Rule of three (programming)
    Rule of three (programming)
    Rule of three is a code refactoring rule of thumb to decide when a replicated piece of code should be replaced by a new procedure. It states that you are allowed to copy and paste the code once, but that when the same code is replicated three times, it should be extracted into a new procedure...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK