File comparison
Encyclopedia
File comparison in computing
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

 compares the contents of computer file
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...

s, finding their common contents and their differences. The result of the comparison may be presented in a graphic user interface or as part of larger tasks in networks, file system
File system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

s, or revision control
Revision control
Revision control, also known as version control and source control , is the management of changes to documents, programs, and other information stored as computer files. It is most commonly used in software development, where a team of people may change the same files...

.

Some widely-used file comparison programs are diff
Diff
In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...

, cmp, FileMerge, Araxis Merge
Araxis Merge
Araxis Merge is a two and three-way visual file comparison , merging and folder synchronization application for Windows and Mac OS X, created by Araxis Ltd...

, WinMerge
WinMerge
WinMerge is a free software tool for file comparison and merging text-like files. It is useful for determining what has changed between versions, and then merging changes between versions.The project is currently dormant.WinMerge runs on Microsoft Windows....

, Beyond Compare
Beyond Compare
Beyond Compare is a file comparison utility. Aside from comparison of files, the program is capable of doing side-by-side comparison of directories, FTP directories, and archives. In an April 2009 review, Beyond Compare was awarded four out of five stars by CNET...

, and Microsoft File Compare
Microsoft File Compare
In computing, fc is a command line program that compares multiple files and outputs the differences between them. The fc command has been included in Microsoft operating systems since MS-DOS 2.0 and is included in all versions of Microsoft Windows.The command is equivalent to the Unix commands...

.

Many text editor
Text editor
A text editor is a type of program used for editing plain text files.Text editors are often provided with operating systems or software development packages, and can be used to change configuration files and programming language source code....

s and word processor
Word processor
A word processor is a computer application used for the production of any sort of printable material....

s perform file comparison to highlight the changes to a document.

Method types

Most file comparison tools find the longest common subsequence between two files. Any data not in the longest common subsequence is presented as an insertion or deletion. Very few file comparison programs find block moves.

In 1978, Paul Heckel published an algorithm that identifies most moved blocks of text. This is used in the IBM History Flow tool
IBM History Flow tool
IBM's History Flow tool is a visualization tool for a time-sequence of snapshots of a document in various stages of its creation. The tool supports tracking contributions to the article by different users, and can identify which parts of a document have remained unchanged over the course of many...

.

Some specialized file comparison tools find the longest increasing subsequence
Longest increasing subsequence
The longest increasing subsequence problem is to find a subsequence of a given sequence in which the subsequence elements are in sorted order, lowest to highest, and in which the subsequence is as long as possible...

 between two files . The rsync
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...

 protocol uses a rolling hash
Rolling hash
A rolling hash is a hash function where the input is hashed in a window that moves through the input.A few hash functions allow a rolling hash to be computed very quickly -- the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new...

 function to compare two files on two distant computers with low communication overhead.

File comparison in word processors is typically at the word level, while comparison in most programming tools is at the line level. Byte or character-level comparison is useful in some specialized applications.

Reasoning

Comparison tools are used for various reasons. When one wishes to compare binary files, byte-level is probably best. But if one wishes to compare text files, a side-by-side visual comparison is usually best. (Note that visual comparison is also necessary for program files that are based upon languages that are human-readable or that are script-based.) This gives the user the chance to decide which file is the preferred one to retain, if the files should be merged together to create one containing all of the differences, or perhaps to keep them both as-is for later reference, through some form of "versioning" control. Versioning is also important for backup purposes.

File comparison is an important, and most likely integral, part of file synchronization and/or backup. Even in backup methodologies, the issue of corruption is an important one. Corruption occurs without warning and without our knowledge; at least usually until too late to recover the missing parts. Usually, the only way to know for sure if a file has become corrupted is when it is next used or opened. Barring that, one must use a comparison tool to at least recognize that a difference has occurred. Therefore, all file sync or backup programs must include file comparison if these programs are to be actually useful and trusted.

When used in automated processes, file comparison can be set to automatically perform the correct method of saving. Usually the default should be to create another version of the same file automatically so that the user does not have to monitor the process at that point in time. Review, for the sake of elimination of unneeded versions of files, can then occur later at a more convenient time.

Historical uses

Prior to file comparison, machines existed to compare magnetic tapes or punch cards. The IBM 519 Card Reproducer could determine whether a deck of punched cards were equivalent. In 1957, John Van Gardner developed a system to compare the check sums of loaded sections of Fortran
Fortran
Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...

 programs to debug compilation problems on the IBM 704
IBM 704
The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in 1954. The 704 was significantly improved over the IBM 701 in terms of architecture as well as implementations which were not compatible with its predecessor.Changes from the 701 included...

.http://www.softwarepreservation.org/projects/FORTRAN/paper/John%20Van%20Gardner%20-%20Fortran%20And%20The%20Genesis%20Of%20Project%20Intercept.pdf

See also

  • Comparison of file comparison tools
    Comparison of file comparison tools
    -General:Basic general information about file comparison software.-Compare Features:-API / Editor Features:-Other features:Some other features which did not fit in previous table-Aspects:What aspects can be / are compared?...

  • Computer-assisted reviewing
    Computer-assisted reviewing
    Computer-assisted reviewing tools are pieces of software based on text-comparison and analysis algorithms. These tools focus on the differences between two documents, taking into account each document's typeface through an intelligent analysis....

  • Data differencing
    Data differencing
    In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target...

  • Delta encoding
    Delta encoding
    Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...

  • Edit distance
    Edit distance
    In information theory and computer science, the edit distance between two strings of characters generally refers to the Levenshtein distance. However, according to Nico Jacobs, “The term ‘edit distance’ is sometimes used to refer to the distance in which insertions and deletions have equal cost and...

  • File synchronization
    File synchronization
    File synchronization in computing is the process of ensuring that computer files in two or more locations are updated via certain rules....

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK