Curse of dimensionality - AbsoluteAstronomy.com

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.

There are multiple phenomena referred to by this name in domains such as sampling

Sampling (statistics)

In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....

, combinatorics

Combinatorics

Combinatorics is a branch of mathematics concerning the study of finite or countable discrete structures. Aspects of combinatorics include counting the structures of a given kind and size , deciding when certain criteria can be met, and constructing and analyzing objects meeting the criteria ,...

, machine learning

Machine learning

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

and data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

. The common theme of these problems is that when the dimensionality increases, the volume

Volume

Volume is the quantity of three-dimensional space enclosed by some closed boundary, for example, the space that a substance or shape occupies or contains....

of the space increases so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data you need to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.

The term curse of dimensionality was coined by Richard E. Bellman when considering problems in dynamic optimization.

The "curse of dimensionality" is often used as a blanket excuse for not dealing with high-dimensional data. However, the effects are not yet completely understood by the scientific community, and there is ongoing research. On one hand, the notion of intrinsic dimension

Intrinsic dimension

In signal processing of multidimensional signals, for example in computer vision, the intrinsic dimension of the signal describes how many variables are needed to represent the signal. For a signal of N variables, its intrinsic dimension M satisfies 0 ≤ M ≤ N.Usually the intrinsic dimension...

refers to the fact that any low-dimensional data space can trivially be turned into a higher dimensional space by adding redundant (e.g. duplicate) or randomized dimensions, and in turn many high-dimensional data sets can be reduced to lower dimensional data without significant information loss. This is also reflected by the effectiveness of dimension reduction methods such as principal component analysis in many situations. For distance functions and nearest neighbor search, recent research also showed that data sets that exhibit the curse of dimensionality properties can still be processed unless there are too many irrelevant dimensions, while relevant dimensions can make some problems such as cluster analysis actually easier. Secondly, methods such as Markov chain Monte Carlo

Markov chain Monte Carlo

Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...

or shared nearest neighbor methods often work very well on data that was considered intractable by other methods due to its high dimensionality.

Combinatorics

In some problems, each variable can take one of several discrete values, or the range of possible values is divided to give a finite number of possibilities. Taking the variables together, a huge number of combinations of values must be considered. This effect is also known as the combinatorial explosion

Combinatorial explosion

In administration and computing, a combinatorial explosion is the rapidly accelerating increase in lines of communication as organizations are added in a process...

. Even in the simplest case of binary variables, the number of possible combinations already is

, exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations.

Sampling

There is an exponential increase in volume

Volume

Volume is the quantity of three-dimensional space enclosed by some closed boundary, for example, the space that a substance or shape occupies or contains....

associated with adding extra dimensions to a mathematical space. For example, 100 evenly-spaced sample points suffice to sample a unit interval

Unit interval

In mathematics, the unit interval is the closed interval , that is, the set of all real numbers that are greater than or equal to 0 and less than or equal to 1...

with no more than 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice with a spacing of 0.01 between adjacent points would require 10²⁰ sample points: thus, in some sense, the 10-dimensional hypercube can be said to be a factor of 10¹⁸ "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.

Optimization

When solving dynamic optimization

Optimization (mathematics)

In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....

problems by numerical backward induction

Backward induction

Backward induction is the process of reasoning backwards in time, from the end of a problem or situation, to determine a sequence of optimal actions. It proceeds by first considering the last time a decision might be made and choosing what to do in any situation at that time. Using this...

, the objective function must be computed for each combination of values. This is a significant obstacle when the dimension of the "state variable" is large.

Machine learning

In machine learning

Machine learning

problems that involve learning a "state-of-nature" (maybe an infinite distribution) from a finite number of data samples in a high-dimensional feature space

Feature space

In pattern recognition a feature space is an abstract space where each pattern sample is represented as a point in n-dimensional space. Its dimension is determined by the number of features used to describe the patterns...

with each feature having a number of possible values, an enormous amount of training data are required to ensure that there are several samples with each combination of values. With a fixed number of training samples, the predictive power reduces as the dimensionality increases, and this is known as the Hughes effect or Hughes phenomenon (named after Gordon F. Hughes).

Bayesian statistics

The curse of dimensionality has often been a difficulty with Bayesian statistics

Bayesian statistics

Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...

, for which the posterior distributions often have many parameters.

However, this problem has been largely overcome by the advent of simulation-based Bayesian inference, especially using Markov chain Monte Carlo

Markov chain Monte Carlo

methods, which suffices for many practical problems. Of course, simulation-based methods converge slowly and therefore simulation-based methods are not a panacea for high-dimensional problems.

Distance functions

When a measure such as a Euclidean distance

Euclidean distance

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...

is defined using many coordinates, there is little difference in the distances between different pairs of samples.

One way to illustrate the "vastness" of high-dimensional Euclidean space is to compare the proportion of a hypersphere

Hypersphere

In mathematics, an n-sphere is a generalization of the surface of an ordinary sphere to arbitrary dimension. For any natural number n, an n-sphere of radius r is defined as the set of points in -dimensional Euclidean space which are at distance r from a central point, where the radius r may be any...

with radius

and dimension

, to that of a hypercube

Hypercube

In geometry, a hypercube is an n-dimensional analogue of a square and a cube . It is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions, perpendicular to each other and of the same length.An...

with sides of length

, and equivalent dimension.

The volume of such a sphere is:

The volume of the cube would be:

As the dimension

of the space increases, the hypersphere becomes an insignificant volume relative to that of the hypercube. This can clearly be seen by comparing the proportions as the dimension

goes to infinity:

.

as

.

Thus, in some sense, nearly all of the high-dimensional space is "far away" from the centre, or, to put it another way, the high-dimensional unit space can be said to consist almost entirely of the "corners" of the hypercube, with almost no "middle". This is an important intuition for understanding the chi-squared distribution.

Given a single distribution, the minimum and the maximum occurring distances become indiscernible as the difference between the minimum and maximum value compared to the minimum value converges to 0:

.

This is often cited as distance functions losing their usefulness in high dimensionality.

Nearest neighbor search

The effect complicates nearest neighbor search

Nearest neighbor search

Nearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...

in high dimensional space. It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions.

However, recent research indicates that the mere number of dimensions does not necessarily result in problems, since relevant additional dimensions can also increase the contrast. In addition, the resulting ranking remains useful to discern close and far neighbors. Irrelevant ("noise") dimensions however reduce the contrast as expected. In time series analysis, where the data are inherently high-dimensional, distance functions also work reliable as long as the signal-to-noise ratio

Signal-to-noise ratio

Signal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power. A ratio higher than 1:1 indicates more signal than noise...

is high enough.

k-nearest neighbor classification

Another effect of high dimensionality on distance functions concerns k-nearest neighbor (k-NN) graphs

Graph (mathematics)

In mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges...

constructed from a data set

Data set

A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

using some distance functions. As dimensionality increases, the indegree distribution of the k-NN digraph

Directed graph

A directed graph or digraph is a pair G= of:* a set V, whose elements are called vertices or nodes,...

becomes skewed

Skewness

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive or negative, or even undefined...

to the right, resulting in the emergence of hubs, as data instances that appear in many more k-NN lists of other instances from the data set

Data set

than expected. This phenomenon can have a considerable impact on various techniques for classification (including the k-NN classifier

K-nearest neighbor algorithm

In pattern recognition, the k-nearest neighbor algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

), semi-supervised learning

Semi-supervised learning

In computer science, semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data...

, and clustering, and it also affects information retrieval

Information retrieval

Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

Combinatorics

Sampling

Optimization

Machine learning

Bayesian statistics

Distance functions

Nearest neighbor search

k-nearest neighbor classification

See also

Further reading