Curse of dimensionality
Encyclopedia
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.

There are multiple phenomena referred to by this name in domains such as sampling
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....

, combinatorics
Combinatorics
Combinatorics is a branch of mathematics concerning the study of finite or countable discrete structures. Aspects of combinatorics include counting the structures of a given kind and size , deciding when certain criteria can be met, and constructing and analyzing objects meeting the criteria ,...

, machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 and data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

. The common theme of these problems is that when the dimensionality increases, the volume
Volume
Volume is the quantity of three-dimensional space enclosed by some closed boundary, for example, the space that a substance or shape occupies or contains....

 of the space increases so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data you need to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.

The term curse of dimensionality was coined by Richard E. Bellman when considering problems in dynamic optimization.

The "curse of dimensionality" is often used as a blanket excuse for not dealing with high-dimensional data. However, the effects are not yet completely understood by the scientific community, and there is ongoing research. On one hand, the notion of intrinsic dimension
Intrinsic dimension
In signal processing of multidimensional signals, for example in computer vision, the intrinsic dimension of the signal describes how many variables are needed to represent the signal. For a signal of N variables, its intrinsic dimension M satisfies 0 ≤ M ≤ N.Usually the intrinsic dimension...

 refers to the fact that any low-dimensional data space can trivially be turned into a higher dimensional space by adding redundant (e.g. duplicate) or randomized dimensions, and in turn many high-dimensional data sets can be reduced to lower dimensional data without significant information loss. This is also reflected by the effectiveness of dimension reduction methods such as principal component analysis in many situations. For distance functions and nearest neighbor search, recent research also showed that data sets that exhibit the curse of dimensionality properties can still be processed unless there are too many irrelevant dimensions, while relevant dimensions can make some problems such as cluster analysis actually easier. Secondly, methods such as Markov chain Monte Carlo
Markov chain Monte Carlo
Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...

 or shared nearest neighbor methods often work very well on data that was considered intractable by other methods due to its high dimensionality.

Combinatorics

In some problems, each variable can take one of several discrete values, or the range of possible values is divided to give a finite number of possibilities. Taking the variables together, a huge number of combinations of values must be considered. This effect is also known as the combinatorial explosion
Combinatorial explosion
In administration and computing, a combinatorial explosion is the rapidly accelerating increase in lines of communication as organizations are added in a process...

. Even in the simplest case of binary variables, the number of possible combinations already is , exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations.

Sampling

There is an exponential increase in volume
Volume
Volume is the quantity of three-dimensional space enclosed by some closed boundary, for example, the space that a substance or shape occupies or contains....

 associated with adding extra dimensions to a mathematical space. For example, 100 evenly-spaced sample points suffice to sample a unit interval
Unit interval
In mathematics, the unit interval is the closed interval , that is, the set of all real numbers that are greater than or equal to 0 and less than or equal to 1...

 with no more than 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice with a spacing of 0.01 between adjacent points would require 1020 sample points: thus, in some sense, the 10-dimensional hypercube can be said to be a factor of 1018 "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.

Optimization

When solving dynamic optimization
Optimization (mathematics)
In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....

 problems by numerical backward induction
Backward induction
Backward induction is the process of reasoning backwards in time, from the end of a problem or situation, to determine a sequence of optimal actions. It proceeds by first considering the last time a decision might be made and choosing what to do in any situation at that time. Using this...

, the objective function must be computed for each combination of values. This is a significant obstacle when the dimension of the "state variable" is large.

Machine learning

In machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 problems that involve learning a "state-of-nature" (maybe an infinite distribution) from a finite number of data samples in a high-dimensional feature space
Feature space
In pattern recognition a feature space is an abstract space where each pattern sample is represented as a point in n-dimensional space. Its dimension is determined by the number of features used to describe the patterns...

 with each feature having a number of possible values, an enormous amount of training data are required to ensure that there are several samples with each combination of values. With a fixed number of training samples, the predictive power reduces as the dimensionality increases, and this is known as the Hughes effect or Hughes phenomenon (named after Gordon F. Hughes).

Bayesian statistics

The curse of dimensionality has often been a difficulty with Bayesian statistics
Bayesian statistics
Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...

, for which the posterior distributions often have many parameters.

However, this problem has been largely overcome by the advent of simulation-based Bayesian inference, especially using Markov chain Monte Carlo
Markov chain Monte Carlo
Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...

 methods, which suffices for many practical problems. Of course, simulation-based methods converge slowly and therefore simulation-based methods are not a panacea for high-dimensional problems.

Distance functions

When a measure such as a Euclidean distance
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...

 is defined using many coordinates, there is little difference in the distances between different pairs of samples.

One way to illustrate the "vastness" of high-dimensional Euclidean space is to compare the proportion of a hypersphere
Hypersphere
In mathematics, an n-sphere is a generalization of the surface of an ordinary sphere to arbitrary dimension. For any natural number n, an n-sphere of radius r is defined as the set of points in -dimensional Euclidean space which are at distance r from a central point, where the radius r may be any...

 with radius and dimension , to that of a hypercube
Hypercube
In geometry, a hypercube is an n-dimensional analogue of a square and a cube . It is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions, perpendicular to each other and of the same length.An...

 with sides of length , and equivalent dimension.

The volume of such a sphere is:



The volume of the cube would be:



As the dimension of the space increases, the hypersphere becomes an insignificant volume relative to that of the hypercube. This can clearly be seen by comparing the proportions as the dimension goes to infinity:

.

as .

Thus, in some sense, nearly all of the high-dimensional space is "far away" from the centre, or, to put it another way, the high-dimensional unit space can be said to consist almost entirely of the "corners" of the hypercube, with almost no "middle". This is an important intuition for understanding the chi-squared distribution.

Given a single distribution, the minimum and the maximum occurring distances become indiscernible as the difference between the minimum and maximum value compared to the minimum value converges to 0:

.

This is often cited as distance functions losing their usefulness in high dimensionality.

Nearest neighbor search

The effect complicates nearest neighbor search
Nearest neighbor search
Nearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...

 in high dimensional space. It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions.

However, recent research indicates that the mere number of dimensions does not necessarily result in problems, since relevant additional dimensions can also increase the contrast. In addition, the resulting ranking remains useful to discern close and far neighbors. Irrelevant ("noise") dimensions however reduce the contrast as expected. In time series analysis, where the data are inherently high-dimensional, distance functions also work reliable as long as the signal-to-noise ratio
Signal-to-noise ratio
Signal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power. A ratio higher than 1:1 indicates more signal than noise...

 is high enough.

k-nearest neighbor classification

Another effect of high dimensionality on distance functions concerns k-nearest neighbor (k-NN) graphs
Graph (mathematics)
In mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges...

 constructed from a data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

 using some distance functions. As dimensionality increases, the indegree distribution of the k-NN digraph
Directed graph
A directed graph or digraph is a pair G= of:* a set V, whose elements are called vertices or nodes,...

 becomes skewed
Skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive or negative, or even undefined...

 to the right, resulting in the emergence of hubs, as data instances that appear in many more k-NN lists of other instances from the data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

 than expected. This phenomenon can have a considerable impact on various techniques for classification (including the k-NN classifier
K-nearest neighbor algorithm
In pattern recognition, the k-nearest neighbor algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

), semi-supervised learning
Semi-supervised learning
In computer science, semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data...

, and clustering, and it also affects information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

.

See also

  • Combinatorial explosion
    Combinatorial explosion
    In administration and computing, a combinatorial explosion is the rapidly accelerating increase in lines of communication as organizations are added in a process...

  • Concentration of measure
    Concentration of measure
    In mathematics, concentration of measure is a principle that is applied in measure theory, probability and combinatorics, and has consequences for other fields such as Banach space theory. Informally, it states that "A random variable that depends in a Lipschitz way on many independent variables ...

  • High-dimensional space
  • Dimension reduction
  • Fourier-related transforms
  • Clustering high-dimensional data
    Clustering high-dimensional data
    Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and...


  • Dynamic programming
    Dynamic programming
    In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

  • Bellman equation
    Bellman equation
    A Bellman equation , named after its discoverer, Richard Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming...

  • Backwards induction
  • Principal component analysis
  • Linear least squares
    Linear least squares
    In statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model...


  • Quasi-random
  • Cluster analysis
  • Wavelet
    Wavelet
    A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to have...

  • Time series
    Time series
    In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...

  • Singular value decomposition
    Singular value decomposition
    In linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....


Further reading

  • Powell, Warren B. 2007. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley, ISBN 0470171553.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK