Least squares support vector machine
Encyclopedia
Least squares support vector machines (LS-SVM) are least squares
Least squares
The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...

 versions of support vector machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s (SVM), which are a set of related supervised learning
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

 methods that analyze data and recognize patterns, and which are used for classification and regression analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

. In this version one finds the solution by solving a set of linear equation
Linear equation
A linear equation is an algebraic equation in which each term is either a constant or the product of a constant and a single variable....

s instead of a convex quadratic programming
Quadratic programming
Quadratic programming is a special type of mathematical optimization problem. It is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables....

 (QP) problem for classical SVMs. Least squares SVM classifiers, were proposed by Suykens and Vandewalle. LS-SVMs are a class of kernel based learning methods.

From support vector machine to least squares support vector machine

Given a training set with input data and corresponding binary class labels , the SVM
SVM
SVM can refer to:* SVM * Saskatchewan Volunteer Medal* Scanning voltage microscopy* Schuylkill Valley Metro* Secure Virtual Machine, or AMD Virtualization , a virtualization technology by AMD* Solaris Volume Manager...

 classifier, according to Vapnik’s original formulation, satisfies the following conditions:


Which is equivalent to


where is the nonlinear map from original space to the high (and possibly infinite) dimensional space.

Inseparable data

In case of such separating hyperplane does not exist, we introduce a so called slack variable such that


According to the structural risk minimization
Structural risk minimization
Structural risk minimization is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the...

 principle, the risk bound is minimized by the following minimization problem:



where are the Lagrangian multipliers. The optimal point will in the saddle point
Saddle point
In mathematics, a saddle point is a point in the domain of a function that is a stationary point but not a local extremum. The name derives from the fact that in two dimensions the surface resembles a saddle that curves up in one direction, and curves down in a different direction...

 of the Lagrangian function, and then we obtain


By substituting w by its expression, we will get the following quadratic programming problem:


where is called the kernel function. Solving this QP problem subject to constrains in (8), we will get the hyperplane
Hyperplane
A hyperplane is a concept in geometry. It is a generalization of the plane into a different number of dimensions.A hyperplane of an n-dimensional space is a flat subset with dimension n − 1...

 in the high dimensional space and hence the classifier
Classifier
Classifier may refer to:*Classifier *Classifier *Classifier *Hierarchical classifier*Linear classifier...

 in the original space.

Least squares SVM formulation

The least squares version of the SVM classifier is obtained by reformulating the minimization problem as:


subject to the equality constraints:


The least squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

 interpretation with binary targets .

Using , we have


with

Hence the LS-SVM classifier formulation is equivalent to


with and
Both and should be considered as hyperparamters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio , therefore the original formulation uses only as tuning parameter. We use both and as parameters in order to provide a Bayesian interpretation to LS-SVM.

The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function:


where are the Lagrange multipliers. The conditions for optimality are


Elimination of and will yield a linear system
Linear system
A linear system is a mathematical model of a system based on the use of a linear operator.Linear systems typically exhibit features and properties that are much simpler than the general, nonlinear case....

 instead of a quadratic programming
Quadratic programming
Quadratic programming is a special type of mathematical optimization problem. It is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables....

 problem:


with , and . Here, is an identity matrix
Identity matrix
In linear algebra, the identity matrix or unit matrix of size n is the n×n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context...

, and is the kernel matrix defined by .

Kernel function K

For the kernel function K(•, •) one typically has the following choices:
  • Linear
    Linear
    In mathematics, a linear map or function f is a function which satisfies the following two properties:* Additivity : f = f + f...

     kernel :
  • Polynomial
    Polynomial
    In mathematics, a polynomial is an expression of finite length constructed from variables and constants, using only the operations of addition, subtraction, multiplication, and non-negative integer exponents...

     kernel of degree :
  • Radial basis function
    Radial basis function
    A radial basis function is a real-valued function whose value depends only on the distance from the origin, so that \phi = \phi; or alternatively on the distance from some other point c, called a center, so that \phi = \phi...

     RBF kernel :
  • MLP kernel :


where , , , and are constants. Notice that the Mercer condition holds for all and values in the polynomial
Polynomial
In mathematics, a polynomial is an expression of finite length constructed from variables and constants, using only the operations of addition, subtraction, multiplication, and non-negative integer exponents...

 and RBF case, but not for all possible choices of and in the MLP case. The scale parameters , and determine the scaling of the inputs in the polynomial, RBF and MLP kernel function. This scaling is related to the bandwidth of the kernel in statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method.

Bayesian interpretation for LS-SVM

A Bayesian
Bayesian
Bayesian refers to methods in probability and statistics named after the Reverend Thomas Bayes , in particular methods related to statistical inference:...

 interpretation of the SVM has been proposed by Smola et al. They showed that the use of different kernels in SVM can be regarded as defining different prior probability
Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...

 distributions on the functional space, as . Here is a constant and is the regularization operator corresponding to the selected kernel.

A general Bayesian evidence framework was developed by MacKay, and MacKay has used it to the problem of regression, forward neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

 and classification network. Provided data set , a model with parameter vector and a so called hyperparameter or regularization parameter , Bayesian inference
Bayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...

 is constructed with 3 levels of inference:
  • In level 1, for a given value of , the first level of inference infers the posterior distribution of by Bayesian rule
  • The second level of inference determines the value of , by maximizing
  • The third level of inference in the evidence framework ranks different models by examining their posterior probabilities


We can see that Bayesian evidence framework is a unified theory for learning
Learning
Learning is acquiring new or modifying existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information. The ability to learn is possessed by humans, animals and some machines. Progress over time tends to follow learning curves.Human learning...

the model and model selection.
Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection. And he also applied Bayesian evidence framework to support vector regression.

Now, given the data points and the hyperparameters and of the model , the model parameters and are estimated by maximizing the posterior . Applying Bayes’ rule, we obtain:


Where is a normalizing constant such the integral over all possible and is equal to 1.
We assume and are independent of the hyperparameter , and are conditional independent, i.e., we assume

When , the distribution of will approximate a uniform distribution. Furthermore, we assume and are Gaussian distribution, so we obtain the a priori distribution of and with to be:


Here is the dimensionality of the feature space, same as the dimensionality of .

The probability of is assumed to depend only on and . We assume that the data points are independently identically distributed (i.i.d.), so that:


In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to:


A Gaussian distribution is taken for the errors as:


It is assumed that the and are determined in such a way that the class centers and are mapped onto the target -1 and +1, respectively. The projections of the class elements follow a multivariate Gaussian distribution, which have variance .

Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes


The maximum posterior density estimates and are then be obtained by minimizing the negative logarithm of (26), so we arrive (10).

External links

  • www.esat.kuleuven.be/sista/lssvmlab/ "Least squares support vector machine Lab (LS-SVMlab) toolbox contains Matlab/C implementations for a number of LS-SVM algorithms."
  • www.kernel-machines.org "Support Vector Machines and Kernel based methods (Smola & Schölkopf)."
  • www.gaussianprocess.org "Gaussian Processes: Data modeling using Gaussian Process priors over functions for regression and classification (MacKay, Williams)"
  • www.support-vector.net "Support Vector Machines and kernel based methods (Cristianini)"
  • dlib: Contains a least-squares SVM implementation for large-scale datasets.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK