Redundancy (information theory) - AbsoluteAstronomy.com

Redundancy in information theory

Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data. Data compression

Data compression

In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

is a way to reduce or eliminate unwanted redundancy, while checksum

Checksum

A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

s are a way of adding desired redundancy for purposes of error detection when communicating over a noisy channel of limited capacity

Channel capacity

In electrical engineering, computer science and information theory, channel capacity is the tightest upper bound on the amount of information that can be reliably transmitted over a communications channel...

Quantitative definition

In describing the redundancy of raw data, recall that the rate

Entropy rate

In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process...

of a source of information is the average entropy

Information entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

per symbol. For memory less sources, this is merely the entropy of each symbol, while, in the most general case of a stochastic process, it is

the limit, as n goes to infinity, of the joint entropy of the first n symbols divided by n. It is common in information theory to speak of the "rate" or "entropy

Information entropy

" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a memory less source is simply

, since by definition there is no interdependence of the successive messages of a memory less source.

The absolute rate of a language or source is simply

the logarithm

Logarithm

The logarithm of a number is the exponent by which another fixed value, the base, has to be raised to produce that number. For example, the logarithm of 1000 to base 10 is 3, because 1000 is 10 to the power 3: More generally, if x = by, then y is the logarithm of x to base b, and is written...

of the cardinality of the message space, or alphabet. (This formula is sometimes called the Hartley function.) This is the maximum possible rate of information that can be transmitted with that alphabet. (The logarithm should be taken to a base appropriate for the unit of measurement in use.) The absolute rate is equal to the actual rate if the source is memory less and has a uniform distribution.

The absolute redundancy can then be defined as

the difference between the absolute rate and the rate.

The quantity

is called the relative redundancy and gives the maximum possible data compression ratio

Data compression ratio

Data compression ratio, also known as compression power, is a computer-science term used to quantify the reduction in data-representation size produced by a data compression algorithm...

, when expressed as the percentage by which a file size can be decreased. (When expressed as a ratio of original file size to compressed file size, the quantity

gives the maximum compression ratio that can be achieved.) Complementary to the concept of relative redundancy is efficiency, defined as

so that

. A memory less source with a uniform distribution has zero redundancy (and thus 100% efficiency), and cannot be compressed.

Other notions of redundancy

A measure of redundancy between two variables is the mutual information

Mutual information

In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...

or a normalized variant. A measure of redundancy among many variables is given by the total correlation

Total correlation

In probability theory and in particular in information theory, total correlation is one of several generalizations of the mutual information. It is also known as the multivariate constraint or multiinformation...

.

Redundancy of compressed data refers to the difference between the expected

Expected value

In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

compressed data length of

messages

(or expected data rate

) and the entropy

(or entropy rate

). (Here we assume the data is ergodic

Ergodic theory

Ergodic theory is a branch of mathematics that studies dynamical systems with an invariant measure and related problems. Its initial development was motivated by problems of statistical physics....

and stationary

Stationary

Stationary can mean:* In statistics and probability: a stationary process.* In mathematics: a stationary point.* In mathematics: a stationary set.* In physics: a time-invariant quantity, such as a constant position or temperature....

, e.g., a memoryless source.) Although the rate difference

can be arbitrarily small as

increased, the actual difference

, cannot, although it can be theoretically upper-bounded by 1 in the case of finite-entropy memoryless sources.

Quantitative definition

Other notions of redundancy

See also