Delta encoding
Encyclopedia
Delta encoding is a way of storing or transmitting data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 in the form of difference
Difference
Difference may refer to:* Difference , a 2005 power metal album* Difference , a concept in computer science* Difference , any systematic way of distinguishing similar coats of arms belonging to members of the same family* Difference , a statement about the relative size or order of two objects**...

s
between sequential data rather than complete files; more generally this is known as data differencing
Data differencing
In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target...

. Delta encoding is sometimes called delta compression, particularly where archival histories of changes are required (e.g., in software project
Project
A project in business and science is typically defined as a collaborative enterprise, frequently involving research or design, that is carefully planned to achieve a particular aim. Projects can be further defined as temporary rather than permanent social systems that are constituted by teams...

s).

The differences are recorded in discrete files called "deltas" or "diffs", after the Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

 file comparison utility, diff
Diff
In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...

. Because changes are often small – for example, changing a few words in a large document, or changing a few records in a large table – delta encoding greatly reduces data redundancy. Collections of unique deltas are substantially more space-efficient than their non-encoded equivalents.

From a logical point of view the difference between two data values is the information required to obtain one value from the other – see relative entropy. The difference between identical values (under some equivalence
Equivalence relation
In mathematics, an equivalence relation is a relation that, loosely speaking, partitions a set so that every element of the set is a member of one and only one cell of the partition. Two elements of the set are considered equivalent if and only if they are elements of the same cell...

) is often called 0 or the neutral element. A good delta should be minimal, or ambiguous unless one element of a pair is present.

Simple example

Perhaps the simplest example is storing values of bytes as differences (deltas) between sequential values, rather than the values themselves. So, instead of 2, 4, 6, 9, 7, we would store 2, 2, 2, 3, −2. This is not very useful when used alone, but it can help further compression of data in which sequential values occur often. IFF
Interchange File Format
Interchange File Format , is a generic container file format originally introduced by the Electronic Arts company in 1985 in order to ease transfer of data between software produced by different companies....

 8SVX
8SVX
8SVX is a subformat of the Interchange File Format. The subformat is for 8-bit sampled sounds, supports both mono and stereo streams as well as loops; commonly used as a basic audio sample format on Amiga computers for many years...

 sound format applies this encoding to raw sound data before applying compression to it. Unfortunately, not even all 8-bit sound samples
Sampling (signal processing)
In signal processing, sampling is the reduction of a continuous signal to a discrete signal. A common example is the conversion of a sound wave to a sequence of samples ....

 compress better when delta encoded, and the usability of delta encoding is even smaller for 16-bit and better samples. Therefore, compression algorithms often choose to delta encode only when the compression is better than without. However, in video compression frames can considerably reduce frame size, and are used in virtually every video compression codec
Codec
A codec is a device or computer program capable of encoding or decoding a digital data stream or signal. The word codec is a portmanteau of "compressor-decompressor" or, more commonly, "coder-decoder"...

.

Definition

A delta can be defined in 2 ways, symmetric delta and directed delta. A symmetric delta can be expressed as:
Δ(v1, v2) = (v1 \ v2) U (v2 \ v1),

where v1 and v2 represent two successive versions.

A directed delta, also called a change, is a sequence of (elementary) change operations which, when applied to one version v1, yields another version v2 (note the correspondence to transaction log
Transaction log
In the field of databases in computer science, a transaction log is a history of actions executed by a database management system to guarantee ACID properties over crashes or hardware failures...

s in databases).

Variants

A variation of delta encoding which encodes differences between the prefixes or suffixes of strings
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

 is called incremental encoding
Incremental encoding
Incremental encoding, also known as front compression, back compression, or front coding, is a type of delta encoding compression algorithm whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated...

. It is particularly effective for sorted lists with small differences between strings, such as a list of word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s from a dictionary
Dictionary
A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...

.

Implementation issues

The nature of the data to be encoded influences the effectiveness of a particular compression algorithm. For example, in sending updates to the Google Chrome
Google Chrome
Google Chrome is a web browser developed by Google that uses the WebKit layout engine. It was first released as a beta version for Microsoft Windows on September 2, 2008, and the public stable release was on December 11, 2008. The name is derived from the graphical user interface frame, or...

 browser, a specialized algorithm is used based on knowledge of the archive and executable format.

Delta encoding performs best when data has small or constant variation; for an unsorted data set, there may be little to no compression possible with this method.

In delta encoded transmission over a network where only a single copy of the file is available at each end of the communication channel, special error control codes are used to detect which parts of the file have changed since its previous version.
For example, rsync
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...

 uses a rolling checksum algorithm based on Mark Adler's adler-32
Adler-32
Adler-32 is a checksum algorithm which was invented by Mark Adler in 1995, and is a modification of the Fletcher checksum. Compared to a cyclic redundancy check of the same length, it trades reliability for speed. Adler-32 is more reliable than Fletcher-16, and slightly less reliable than Fletcher-32...

 checksum.

Sample C code

The following C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 code performs a simple form of delta encoding and decoding:


void
delta_encode (char *buffer, const unsigned int length)
{
char delta = 0;
char original;
unsigned int i;
for (i = 0; i < length; ++i)
{
original = buffer[i];
buffer[i] -= delta;
delta = original;
}
}

void
delta_decode (char *buffer, const unsigned int length)
{
char delta = 0;
unsigned int i;
for (i = 0; i < length; ++i)
{
buffer[i] += delta;
delta = buffer[i];
}
}

Delta encoding in HTTP

Another instance of use of delta encoding is RFC 3229, "Delta encoding in HTTP", which proposes that HTTP servers should be able to send updated Web pages in the form of differences between versions (deltas), which should decrease Internet traffic, as most pages change slowly over time, rather than being completely rewritten repeatedly:

Online backup

Many of the online backup services adopt this methodology, often known simply as deltas, in order to give their users previous versions of the same file from previous backups. This reduces associated costs, not only in the amount of data that has to be stored as differing versions (as the whole of each changed version of a file has to be offered for users to access), but also those costs in the uploading (and sometimes the downloading) of each file that has been updated (by just the smaller delta having to be used, rather than the whole file).

VCDIFF

One general format for delta encoding is VCDIFF, described in RFC 3284. Free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 implementations include Xdelta
Xdelta
xdelta is a command line program for delta encoding, which generates two file difference. This is similar to diff and patch, but it is targeted for binary files and does not generate human readable output....

 and open-vcdiff.

GDIFF

Generic Diff Format (GDIFF) is another delta encoding. It is submitted to W3C in 1997. In many cases, VCDIFF has better compression rate than GDIFF.

bsdiff

Bsdiff is a binary diff program using suffix sorting.

See also

  • String-to-string correction problem
    String-to-string correction problem
    The string-to-string correction problem refers to the minimum number of edit operations necessary to change one string into another. A single edit operation may be changing a single symbol of the string into another, deleting, or inserting a symbol...

  • Data differencing
    Data differencing
    In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target...

  • Source Code Control System
    Source Code Control System
    Source Code Control System is an early revision control system, geared toward program source code and other text files. It was originally developed in SNOBOL at Bell Labs in 1972 by Marc J. Rochkind for an IBM System/370 computer running OS/360 MVT...


External links

  • RFC 3229 – Delta Encoding in HTTP
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK