Context mixing
Encyclopedia
Context mixing is a type of data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

 algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 in which the next-symbol
Symbol
A symbol is something which represents an idea, a physical entity or a process but is distinct from it. The purpose of a symbol is to communicate meaning. For example, a red octagon may be a symbol for "STOP". On a map, a picture of a tent might represent a campsite. Numerals are symbols for...

 predictions of two or more statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

s are combined to yield a prediction that is often more accurate than any of the individual predictions. For example, one simple method (not necessarily the best) is to average
Average
In mathematics, an average, or central tendency of a data set is a measure of the "middle" value of the data set. Average is one form of central tendency. Not all central tendencies should be considered definitions of average....

 the probabilities assigned by each model
Mathematical model
A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used not only in the natural sciences and engineering disciplines A mathematical model is a...

. The random forest
Random forest
Random forest is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark...

 is another method: it outputs the prediction that is the mode
Mode (statistics)
In statistics, the mode is the value that occurs most frequently in a data set or a probability distribution. In some fields, notably education, sample data are often called scores, and the sample mode is known as the modal score....

 of the predictions output by individual models. Combining models is an active area of research in machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

.

The PAQ
PAQ
PAQ is a series of lossless data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio . Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge...

 series of data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

 programs use context mixing to assign probabilities to individual bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

s of the input.

Application to Data Compression

Suppose that we are given two conditional probabilities, P(X|A) and P(X|B), and we wish to estimate P(X|A,B), the probability of event X given both conditions A and B. There is insufficient information for probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

 to give a result. In fact, it is possible to construct scenarios in which the result could be anything at all. But intuitively, we would expect the result to be some kind of average of the two.

The problem is important for data compression. In this application, A and B are contexts, X is the event that the next bit or symbol of the data to be compressed has a particular value, and P(X|A) and P(X|B) are the probability estimates by two independent models. The compression ratio depends on how closely the estimated probability approaches the true but unknown probability of event X. It is often the case that contexts A and B have occurred often enough to accurately estimate P(X|A) and P(X|B) by counting occurrences of X in each context, but the two contexts either have not occurred together frequently, or there are insufficient computing resources (time and memory) to collect statistics for the combined case.

For example, suppose that we are compressing a text file. We wish to predict whether the next character will be a linefeed, given that the previous character was a period (context A) and that the last linefeed occurred 72 characters ago (context B). Suppose that a linefeed previously occurred after 1 of the last 5 periods (P(X|A) = 1/5 = 0.2) and in 5 out of the last 10 lines at column 72 (P(X|B) = 5/10 = 0.5). How should these predictions be combined?

Two general approaches have been used, linear and logistic mixing. Linear mixing uses a weighted average of the predictions weighted by evidence. In this example, P(X|B) gets more weight than P(X|A) because P(X|B) is based on a greater number of tests. Older versions of PAQ
PAQ
PAQ is a series of lossless data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio . Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge...

 uses this approach. Newer versions use logistic (or neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

) mixing by first transforming the predictions into the logistic
Logistic function
A logistic function or logistic curve is a common sigmoid curve, given its name in 1844 or 1845 by Pierre François Verhulst who studied it in relation to population growth. It can model the "S-shaped" curve of growth of some population P...

 domain, log(p/(1-p)) before averaging. This effectively gives greater weight to predictions near 0 or 1, in this case P(X|A). In both cases, additional weights may be given to each of the input models and adapted to favor the models that have given the most accurate predictions in the past. All but the oldest versions of PAQ use adaptive weighting.

Most context mixing compressors predict one bit of input at a time. The output probability is simply the probability that the next bit will be a 1.

Linear Mixing

We are given a set of predictions Pi(1) = n1i/ni, where ni = n0i + n1i, and n0i and n1i are the counts of 0 and 1 bits respectively for the i'th model. The probabilities are computed by weighted addition of the 0 and 1 counts:
  • S0 = Σi wi n0i
  • S1 = Σi wi n1i
  • S = S0 + S1
  • P(0) = S0 / S
  • P(1) = S1 / S


The weights wi are initially equal and always sum to 1. Under the initial conditions, each model is weighted in proportion to evidence. The weights are then adjusted to favor the more accurate models. Suppose we are given that the actual bit being predicted is y (0 or 1). Then the weight adjustment is:
  • ni = n0i + n1i
  • error = y – P(1)
  • wi ← wi + [(S n1i - S1 ni) / (S0 S1)] error


Compression can be improved by bounding ni so that the model weighting is better balanced. In PAQ6, whenever one of the bit counts is incremented, the part of the other count that exceeds 2 is halved. For example, after the sequence 000000001, the counts would go from (n0, n1) = (8, 0) to (5, 1).

Logistic Mixing

Let Pi(1) be the prediction by the i'th model that the next bit will be a 1. Then the final prediction P(1) is calculated:
  • xi = stretch(Pi(1))
  • P(1) = squash(Σi wi xi)


where P(1) is the probability that the next bit will be a 1, Pi(1) is the probability estimated by the i'th model, and
  • stretch(x) = ln(x / (1 - x))
  • squash(x) = 1 / (1 + e-x) (inverse of stretch).


After each prediction, the model is updated by adjusting the weights to minimize coding cost.
  • wi ← wi + η xi (y - P(1))


where η is the learning rate (typically 0.002 to 0.01), y is the predicted bit, and (y - P(1)) is the prediction error.

List of Context Mixing Compressors

All versions below use logistic mixing unless otherwise indicated.
  • All PAQ
    PAQ
    PAQ is a series of lossless data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio . Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge...

     versions (Matt Mahoney, Serge Osnach, Alexander Ratushnyak, Przemysław Skibiński, Jan Ondrus, and others) http://mattmahoney.net/dc/paq.html. PAQAR and versions prior to PAQ7 used linear mixing. Later versions used logistic mixing.
  • All LPAQ versions (Matt Mahoney, Alexander Ratushnyak) http://mattmahoney.net/dc/#lpaq.
  • ZPAQ
    ZPAQ
    ZPAQ is a proposed open format for highly compressed data, mainly created by Matt Mahoney. It is designed so that new compression algorithms may be developed without breaking compatibility with older versions of ZPAQ-compliant decompression programs. ZPAQ uses a configurable context mixing...

     (Matt Mahoney) http://mattmahoney.net/dc/#zpaq.
  • WinRK 3.0.3 (Malcolm Taylor) in maximum compression PWCM mode http://www.msoftware.co.nz/. Version 3.0.2 was based on linear mixing.
  • NanoZip (Sami Runsas) in maximum compression mode (option -cc) http://www.nanozip.net/.
  • xwrt 3.2 (Przemysław Skibiński) in maximum compression mode (options -i10 through -i14) http://sourceforge.net/projects/xwrt/files/ as a back end to a dictionary encoder.
  • cmm1 through cmm4, M1, and M1X2 (Christopher Mattern) use a small number of contexts for high speed. M1 and M1X2 use a genetic algorithm
    Genetic algorithm
    A genetic algorithm is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems...

     to select two bit masked contexts in a separate optimization pass.
  • ccm (Christian Martelock).
  • bit (Osman Turan) http://www.osmanturan.com/bit07.zip.
  • pimple, pimple2, tc, and px (Ilia Muraviev) http://lovepimple.110mb.com/.
  • enc (Serge Osnach) tries several methods based on PPM and (linear) context mixing and chooses the best one. http://www.compression.ru/so/
  • fpaq2 (Nania Francesco Antonio) using fixed weight averaging for high speed.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK