Byte pair encoding - AbsoluteAstronomy.com

Byte pair encoding or digram coding is a simple form of data compression

Data compression

In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

in which the most common pair of consecutive byte

Byte

The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

s of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression"
in the C Users Journal.

Byte pair encoding example

Suppose we wanted to encode the data

aaabdaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now we have the following data and replacement table:

ZabdZabac
Z=aa

Then we repeat the process with byte pair "ab", replacing it with Y:

ZYdZYac
Y=ab
Z=aa

We could stop here, as the only literal byte pair left occurs only once.
Or we could continue the process and use recursive

Recursive

Recursive may refer to:*Recursion, the technique of functions calling themselves*Recursive function, a total computable function*Recursive language, a language which is decidable...

byte pair encoding, replacing "ZY" with "X":

XdXac
X=ZY
Y=ab
Z=aa

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.