Large margin nearest neighbor
Encyclopedia
Large margin nearest neighbor (LMNN) classification is a statistical machine learning
algorithm
. It learns a Pseudometric
designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming
, a sub-class of convex optimization.
The goal of supervised learning
(more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric
. Large Margin Nearest Neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule.
under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out
error (a special case of cross validation) is minimized. Let the training data consist of a data set , where the set of possible class categories is .
The algorithm learns a pseudometric
of the type.
For to be well defined, the matrix needs to be positive semi-definite. The Euclidean metric is a special case, where is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis Metric.
Figure 1 illustrates the effect of the metric under varying . The two circles show the set of points with equal distance to the center . In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid.
The algorithm distinguishes between two types of special data points: target neighbors and impostors.
. The objective is twofold: For every data point , the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the nearest neighbor rule.
The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors.
The second goal is achieved by constraining impostors to be one unit further away than target neighbors (and therefore pushing them out of the local neighborhood of ). The resulting inequality constraint can be stated as:
The margin of exactly one unit fixes the scale of the matrix . Any alternative choice would result in a rescaling of by a factor of .
The final optimization problem becomes:
Here the slack variables absorb the amount of violations of the impostor constraints. Their overall sum is minimized. The last constraint ensures that is positive semi-definite. The optimization problem is an instance of semidefinite programming
(SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime. A particularly well suited solver technique is the working set
method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness.
This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research, Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source
Matlab
implementation is freely available at the authors web page.
Torresani and Lee,
use the Kernel trick
to indirectly incorporate non-linear feature transformations and solve LMNN in an inner product space
.
Kumal et al. extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
. It learns a Pseudometric
Pseudometric
Pseudometric may refer to:* Pseudo-Riemannian manifold* Pseudometric space...
designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming
Semidefinite programming
Semidefinite programming is a subfield of convex optimization concerned with the optimization of a linear objective functionover the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron....
, a sub-class of convex optimization.
The goal of supervised learning
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
(more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
. Large Margin Nearest Neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule.
Setup
The main intuition behind LMNN is to learn a pseudometricPseudometric
Pseudometric may refer to:* Pseudo-Riemannian manifold* Pseudometric space...
under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out
Leave-one-out
The simplest, and a commonly used method of crossvalidation in chemometrics is the “leave-one-out” method. The idea behind this method is to predict the property value for a compound from the data set, which is in turn predicted from the regression equation calculated from the data for all other...
error (a special case of cross validation) is minimized. Let the training data consist of a data set , where the set of possible class categories is .
The algorithm learns a pseudometric
Pseudometric
Pseudometric may refer to:* Pseudo-Riemannian manifold* Pseudometric space...
of the type.
For to be well defined, the matrix needs to be positive semi-definite. The Euclidean metric is a special case, where is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis Metric.
Figure 1 illustrates the effect of the metric under varying . The two circles show the set of points with equal distance to the center . In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid.
The algorithm distinguishes between two types of special data points: target neighbors and impostors.
Target Neighbors
Target neighbors are selected before learning. Each instance has exactly different target neighbors within , which all share the same class label . The target neighbors are the data points that should become nearest neighbors under the learned metric. Let us denote the set of target neighbors for a data point as .Impostors
An impostor of a data point is another data point with a different class label (i.e. ) which is one of the nearest neighbors of . During learning the algorithm tries to minimize the number of impostors for all data instances in the training set.Algorithm
Large Margin Nearest Neighbors optimizes the matrix with the help of semidefinite programmingSemidefinite programming
Semidefinite programming is a subfield of convex optimization concerned with the optimization of a linear objective functionover the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron....
. The objective is twofold: For every data point , the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the nearest neighbor rule.
The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors.
The second goal is achieved by constraining impostors to be one unit further away than target neighbors (and therefore pushing them out of the local neighborhood of ). The resulting inequality constraint can be stated as:
The margin of exactly one unit fixes the scale of the matrix . Any alternative choice would result in a rescaling of by a factor of .
The final optimization problem becomes:
Here the slack variables absorb the amount of violations of the impostor constraints. Their overall sum is minimized. The last constraint ensures that is positive semi-definite. The optimization problem is an instance of semidefinite programming
Semidefinite programming
Semidefinite programming is a subfield of convex optimization concerned with the optimization of a linear objective functionover the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron....
(SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime. A particularly well suited solver technique is the working set
Working set
Peter Denning defines “the working set of information W of a process at time t to be the collection of information referenced by the process during the process time interval ”. Typically the units of information in question are considered to be memory pages...
method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness.
Extensions and efficient solvers
LMNN was extended to multiple local metrics in the 2008 paper.This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research, Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
Matlab
MATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...
implementation is freely available at the authors web page.
Torresani and Lee,
use the Kernel trick
Kernel trick
For machine learning algorithms, the kernel trick is a way of mapping observations from a general set S into an inner product space V , without ever having to compute the mapping explicitly, in the hope that the observations will gain meaningful linear structure in V...
to indirectly incorporate non-linear feature transformations and solve LMNN in an inner product space
Inner product space
In mathematics, an inner product space is a vector space with an additional structure called an inner product. This additional structure associates each pair of vectors in the space with a scalar quantity known as the inner product of the vectors...
.
Kumal et al. extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.
See also
- Linear discriminant analysisLinear discriminant analysisLinear discriminant analysis and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events...
- Learning Vector Quantization
- PseudometricPseudometricPseudometric may refer to:* Pseudo-Riemannian manifold* Pseudometric space...
- Nearest neighbor searchNearest neighbor searchNearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...
- Cluster analysis
- Data classificationData classificationIn the field of data management, data classification as a part of Information Lifecycle Management process can be defined as tool for categorization of data to enable/help organization to effectively answer following questions:...
- Data miningData miningData mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
- Machine learningMachine learningMachine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
- Pattern recognitionPattern recognitionIn machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...
- Predictive analyticsPredictive analyticsPredictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....
- Dimension reduction