Support vector machine

# Support vector machine

Overview
A support vector machine (SVM) is a concept in statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

and computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

for a set of related supervised learning
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

methods that analyze data and recognize patterns, used for classification and regression analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

. The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes comprises the input, making the SVM a non-probabilistic
Probabilistic logic
The aim of a probabilistic logic is to combine the capacity of probability theory to handle uncertainty with the capacity of deductive logic to exploit structure. The result is a richer and more expressive formalism with a broad range of possible application areas...

binary linear classifier
Linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...

. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
Discussion
 Ask a question about 'Support vector machine' Start a new discussion about 'Support vector machine' Answer questions from other users Full Discussion Forum

Recent Discussions
Encyclopedia
A support vector machine (SVM) is a concept in statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

and computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

for a set of related supervised learning
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

methods that analyze data and recognize patterns, used for classification and regression analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

. The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes comprises the input, making the SVM a non-probabilistic
Probabilistic logic
The aim of a probabilistic logic is to combine the capacity of probability theory to handle uncertainty with the capacity of deductive logic to exploit structure. The result is a richer and more expressive formalism with a broad range of possible application areas...

binary linear classifier
Linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...

. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

## Formal definition

More formally, a support vector machine constructs a hyperplane
Hyperplane
A hyperplane is a concept in geometry. It is a generalization of the plane into a different number of dimensions.A hyperplane of an n-dimensional space is a flat subset with dimension n − 1...

or set of hyperplanes in a high- or infinite- dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error
Generalization error
The generalization error of a machine learning model is a function that measures how far the student machine is from the teacher machine in average over the entire set of possible data that can be generated by the teacher after each iteration of the learning process...

of the classifier.

Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mapping used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function  selected to suit the problem. The hyperplanes in the higher dimensional space are defined as the set of points whose inner product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters of images of feature vectors that occur in the data base. With this choice of a hyperplane, the points x in the feature space that are mapped into the hyperplane are defined by the relation: Note that if becomes small as grows further from , each element in the sum measures the degree of closeness of the test point to the corresponding data base point . In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points mapped into any hyperplane can be quite convoluted as a result allowing much more complex discrimination between sets which are not convex at all in the original space.

## History

The original SVM algorithm was invented by Vladimir Vapnik
Vladimir Naumovich Vapnik is one of the main developers of Vapnik–Chervonenkis theory. He was born in the Soviet Union. He received his master's degree in mathematics at the Uzbek State University, Samarkand, Uzbek SSR in 1958 and Ph.D in statistics at the Institute of Control Sciences, Moscow in...

and the current standard incarnation (soft margin) was proposed by Vapnik and Corinna Cortes
Corinna Cortes
Corinna Cortes is an American computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award for her work on theoretical foundations of support...

in 1995.

## Motivation

Classifying data is a common task in machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

.
Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), and we want to know whether we can separate such points with a (p − 1)-dimensional hyperplane
Hyperplane
A hyperplane is a concept in geometry. It is a generalization of the plane into a different number of dimensions.A hyperplane of an n-dimensional space is a flat subset with dimension n − 1...

. This is called a linear classifier
Linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...

. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane
Maximum-margin hyperplane
In geometry, a maximum-margin hyperplane is a hyperplane which separates two 'clouds' of points and is at equal distance from the two. The margin between the hyperplane and the clouds is maximal. See the article on Support Vector Machines for more details....

and the linear classifier it defines is known as a maximum margin classifier
Margin classifier
In machine learning, a margin classifer is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier In machine learning, a margin classifer is a classifier which is able to give an associated distance from the...

; or equivalently, the perceptron
Perceptron
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...

of optimal stability.

## Linear SVM

We are given some training data , a set of n points of the form

where the yi is either 1 or −1, indicating the class to which the point belongs. Each is a p-dimensional real
Real number
In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...

vector. We want to find the maximum-margin hyperplane that divides the points having from those having . Any hyperplane can be written as the set of points satisfying

where denotes the dot product
Dot product
In mathematics, the dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers and returns a single number obtained by multiplying corresponding entries and then summing those products...

and the normal vector
Surface normal
A surface normal, or simply normal, to a flat surface is a vector that is perpendicular to that surface. A normal to a non-flat surface at a point P on the surface is a vector perpendicular to the tangent plane to that surface at P. The word "normal" is also used as an adjective: a line normal to a...

to the hyperplane. The parameter determines the offset of the hyperplane from the origin along the normal vector .

We want to choose the and to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. These hyperplanes can be described by the equations

and

Note that if the training data are linearly separable
Linearly separable
In geometry, two sets of points in a two-dimensional space are linearly separable if they can be completely separated by a single line. In general, two point sets are linearly separable in n-dimensional space if they can be separated by a hyperplane....

, we can select the two hyperplanes of the margin in a way that there are no points between them and then try to maximize their distance. By using geometry, we find the distance between these two hyperplanes is , so we want to minimize . As we also have to prevent data points from falling into the margin, we add the following constraint: for each either
of the first class

or
of the second.

This can be rewritten as:

We can put this together to get the optimization problem:

Minimize (in )

subject to (for any )

### Primal form

The optimization problem presented in the preceding section is difficult to solve because it depends on ||w||, the norm of w, which involves a square root.
Fortunately it is possible to alter the equation by substituting ||w|| with (the factor of 1/2 being used for mathematical convenience) without changing the solution (the minimum of the original and the modified equation have the same w and b). This is a quadratic programming
Quadratic programming is a special type of mathematical optimization problem. It is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables....

optimization
Optimization (mathematics)
In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....

problem. More clearly:

Minimize (in )

subject to (for any )

One could be tempted to express the previous problem by means of non-negative Lagrange multipliers
Lagrange multipliers
In mathematical optimization, the method of Lagrange multipliers provides a strategy for finding the maxima and minima of a function subject to constraints.For instance , consider the optimization problem...

as

but this would be wrong. The reason is the following: suppose we can find a family of hyperplanes which divide the points; then all .
Hence we could find the minimum by sending all to , and this minimum would be reached for all the members of the family, not only for the best one which can be chosen solving the original problem.

Nevertheless the previous constrained problem can be expressed as

that is we look for a saddle point. In doing so all the points which can be separated as do not matter since we must set the corresponding to zero.

This problem can now be solved by standard quadratic programming
Quadratic programming is a special type of mathematical optimization problem. It is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables....

techniques and programs. The solution can be expressed by terms of linear combination of the training vectors as

Only a few will be greater than zero. The corresponding are exactly the support vectors, which lie on the margin and satisfy . From this one can derive that the support vectors also satisfy

which allows one to define the offset . In practice, it is more robust to average over all support vectors:

### Dual form

Writing the classification rule in its unconstrained dual form
Dual problem
In constrained optimization, it is often possible to convert the primal problem to a dual form, which is termed a dual problem. Usually dual problem refers to the Lagrangian dual problem but other dual problems are used, for example, the Wolfe dual problem and the Fenchel dual problem...

reveals that the maximum margin hyperplane and therefore the classification task is only a function of the support vectors, the training data that lie on the margin.

Using the fact, that and substituting , one can show that the dual of the SVM reduces to the following optimization problem:

Maximize (in )

subject to (for any )

and to the constraint from the minimization in

Here the kernel is defined by .

can be computed thanks to the terms:

### Biased and unbiased hyperplanes

For simplicity reasons, sometimes it is required that the hyperplane passes through the origin of the coordinate system. Such hyperplanes are called unbiased, whereas general hyperplanes not necessarily passing through the origin are called biased. An unbiased hyperplane can be enforced by setting in the primal optimization problem. The corresponding dual is identical to the dual given above without the equality constraint

## Soft margin

In 1995, Corinna Cortes
Corinna Cortes
Corinna Cortes is an American computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award for her work on theoretical foundations of support...

Vladimir Naumovich Vapnik is one of the main developers of Vapnik–Chervonenkis theory. He was born in the Soviet Union. He received his master's degree in mathematics at the Uzbek State University, Samarkand, Uzbek SSR in 1958 and Ph.D in statistics at the Institute of Control Sciences, Moscow in...

suggested a modified maximum margin idea that allows for mislabeled examples. If there exists no hyperplane that can split the "yes" and "no" examples, the Soft Margin method will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. The method introduces slack variables, , which measure the degree of misclassification of the datum

The objective function is then increased by a function which penalizes non-zero , and the optimization becomes a trade off between a large margin, and a small error penalty. If the penalty function is linear, the optimization problem becomes:

subject to (for any )

This constraint in (2) along with the objective of minimizing can be solved using Lagrange multipliers
Lagrange multipliers
In mathematical optimization, the method of Lagrange multipliers provides a strategy for finding the maxima and minima of a function subject to constraints.For instance , consider the optimization problem...

as done above.
One has then to solve the following problem

with .

### Dual form

Maximize (in )

subject to (for any )
and

The key advantage of a linear penalty function is that the slack variables vanish from the dual problem, with the constant C appearing only as an additional constraint on the Lagrange multipliers. For the above formulation and its huge impact in practice, Cortes and Vapnik received the 2008 ACM Paris Kanellakis Award
Paris Kanellakis Award
The Paris Kanellakis Theory and Practice Award is granted yearly by the Association for Computing Machinery to honor specific theoretical accomplishments that have had a significant and demonstrable effect on the practice of computing...

. Nonlinear penalty functions have been used, particularly to reduce the effect of outliers on the classifier, but unless care is taken, the problem becomes non-convex, and thus it is considerably more difficult to find a global solution.

## Nonlinear classification

The original optimal hyperplane algorithm proposed by Vapnik in 1963 was a linear classifier
Linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...

. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick
Kernel trick
For machine learning algorithms, the kernel trick is a way of mapping observations from a general set S into an inner product space V , without ever having to compute the mapping explicitly, in the hope that the observations will gain meaningful linear structure in V...

(originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product
Dot product
In mathematics, the dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers and returns a single number obtained by multiplying corresponding entries and then summing those products...

is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space
Feature space
In pattern recognition a feature space is an abstract space where each pattern sample is represented as a point in n-dimensional space. Its dimension is determined by the number of features used to describe the patterns...

. The transformation may be nonlinear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space.

If the kernel used is a Gaussian
GAUSSIAN
Gaussian is a computational chemistry software program initially released in 1970 by John Pople and his research group at Carnegie-Mellon University as Gaussian 70. It has been continuously updated since then...

A radial basis function is a real-valued function whose value depends only on the distance from the origin, so that \phi = \phi; or alternatively on the distance from some other point c, called a center, so that \phi = \phi...

, the corresponding feature space is a Hilbert space
Hilbert space
The mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions...

of infinite dimensions. Maximum margin classifiers are well regularized
Regularization (mathematics)
In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...

, so the infinite dimensions do not spoil the results. Some common kernels include:
• Polynomial (homogeneous)
Homogeneous polynomial
In mathematics, a homogeneous polynomial is a polynomial whose monomials with nonzero coefficients all have thesame total degree. For example, x^5 + 2 x^3 y^2 + 9 x y^4 is a homogeneous polynomial...

:
• Polynomial (inhomogeneous):
A radial basis function is a real-valued function whose value depends only on the distance from the origin, so that \phi = \phi; or alternatively on the distance from some other point c, called a center, so that \phi = \phi...

: , for Sometimes parametrized using
• Hyperbolic tangent
Hyperbolic function
In mathematics, hyperbolic functions are analogs of the ordinary trigonometric, or circular, functions. The basic hyperbolic functions are the hyperbolic sine "sinh" , and the hyperbolic cosine "cosh" , from which are derived the hyperbolic tangent "tanh" and so on.Just as the points form a...

: , for some (not every) and

The kernel is related to the transform by the equation . The value w is also in the transformed space, with Dot products with w for classification can again be computed by the kernel trick, i.e. . However, there does not in general exist a value w' such that

## Properties

SVMs belong to a family of generalized linear classifier
Linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...

s and can be interpreted as an extension of the perceptron
Perceptron
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...

. They can also be considered a special case of Tikhonov regularization
Tikhonov regularization
Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as ridge regression, and, with multiple independent discoveries, it is also variously known as the Tikhonov-Miller method, the...

. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifier
Margin classifier
In machine learning, a margin classifer is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier In machine learning, a margin classifer is a classifier which is able to give an associated distance from the...

s
.

A comparison of the SVM to other classifiers has been made by Meyer, Leisch and Hornik.

### Parameter selection

The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C.

A common choice is a Gaussian kernel, which has a single parameter γ. Best combination of C and γ is often selected by a grid-search with exponentially growing sequences of C and γ, for example, ; . Typically, each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked. The final model, which is used for testing and for classifying new data, is then trained on the whole training set using the selected parameters.

### Issues

Potential drawbacks of the SVM are the following three aspects:
• Uncalibrated class membership probabilities
Class membership probabilities
In general proplems of classification, class membership probabilities reflect the uncertainty with which a given indivual item can be assigned to any given class. Although statistical classification methods by definition generate such probabilities, applications of classification in machine...

• The SVM is only directly applicable for two-class tasks. Therefore, algorithms that reduce the multi-class task to several binary problems have to be applied; see the multi-class SVM section.
• Parameters of a solved model are difficult to interpret.

### Multiclass SVM

Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.

The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification
Binary classification
Binary classification is the task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not. Some typical binary classification tasks are...

problems. Common methods for such reduction include:
• Building binary classifiers which distinguish between (i) one of the labels to the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with most votes determines the instance classification.
• DAGSVM
• error-correcting output codes

Crammer and Singer proposed a multiclass SVM method which casts the multiclass classification problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.

### Transductive support vector machines

Transductive support vector machines extend SVMs in that they could also treat partially labeled data in semi-supervised learning
Semi-supervised learning
In computer science, semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data...

by following the principles of transduction
Transduction (machine learning)
In logic, statistical inference, and supervised learning,transduction or transductive inference is reasoning fromobserved, specific cases to specific cases. In contrast,induction is reasoning from observed training cases...

. Here, in addition to the training set , the learner is also given a set

of test examples to be classified. Formally, a transductive support vector machine is defined by the following primal optimization problem:

Minimize (in )

subject to (for any and any )

and

Transductive support vector machines were introduced by Vladimir Vapnik in 1998.

### Structured SVM

SVMs have been generalized to structured SVM
Structured SVM
The structured support vector machine is a machine learning algorithm that generalizes the Support Vector Machine classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general...

s, where the label space is structured and of possibly infinite size.

### Regression

A version of SVM for regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

was proposed in 1996 by Vladimir Vapnik
Vladimir Naumovich Vapnik is one of the main developers of Vapnik–Chervonenkis theory. He was born in the Soviet Union. He received his master's degree in mathematics at the Uzbek State University, Samarkand, Uzbek SSR in 1958 and Ph.D in statistics at the Institute of Control Sciences, Moscow in...

, Harris Drucker, Chris Burges, Linda Kaufman and Alex Smola. This method is called support vector regression (SVR). The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction (within a threshold ). Another SVM version known as least squares support vector machine
Least squares support vector machine
Least squares support vector machines are least squares versions of support vector machines , which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis...

(LS-SVM) has been proposed in Suykens and Vandewalle.

## Implementation

The parameters of the maximum-margin hyperplane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the QP
Quadratic programming is a special type of mathematical optimization problem. It is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables....

problem that arises from SVMs, mostly relying on heuristics for breaking the problem down into smaller, more-manageable chunks.

A common method is Platt's Sequential Minimal Optimization
Sequential Minimal Optimization
Sequential minimal optimization is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines. It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by...

(SMO) algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

, which breaks the problem down into 2-dimensional sub-problems that may be solved analytically, eliminating the need for a numerical optimization algorithm.

Another approach is to use an interior point method
Interior point method
Interior point methods are a certain class of algorithms to solve linear and nonlinear convex optimization problems.The interior point method was invented by John von Neumann...

that uses Newton
Newton's method
In numerical analysis, Newton's method , named after Isaac Newton and Joseph Raphson, is a method for finding successively better approximations to the roots of a real-valued function. The algorithm is first in the class of Householder's methods, succeeded by Halley's method...

-like iterations to find a solution of the Karush–Kuhn–Tucker conditions of the primal and dual problems.
Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used in the kernel trick.

In situ adaptive tabulation is an algorithm for the approximation of nonlinear relationships. ISAT is based on multiple linear regressions that are dynamically added as additional information is discovered. The technique is adaptive as it adds new linear regressions dynamically to a store of...

• Kernel machines
• Predictive analytics
Predictive analytics
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....

• Relevance vector machine
Relevance Vector Machine
Relevance vector machine is a machine learning technique that uses Bayesian inference to obtain parsimonious solutions for regression and classification...

, a probabilistic sparse kernel model identical in functional form to SVM
• Sequential minimal optimization
Sequential Minimal Optimization
Sequential minimal optimization is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines. It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by...

• A Tutorial on Support Vector Machines for Pattern Recognition by Christopher J. C. Burges. Data Mining and Knowledge Discovery 2:121–167, 1998
• www.kernel-machines.org (general information and collection of research papers)
• www.support-vector-machines.org (Literature, Review, Software, Links related to Support Vector Machines — Academic Site)
• videolectures.net (SVM-related video lectures)
• Animation clip: SVM with polynomial kernel visualization.
• A very basic SVM tutorial for complete beginners by Tristan Fletcher http://www.tristanfletcher.co.uk/SVM%20Explained.pdf.
• www.shogun-toolbox.org (Shogun (toolbox)
Shogun (toolbox)
Shogun is an Free software, open source toolbox written in C++. It offers numerous algorithms and data structures for machine learning problems.Shogun is licensed under the terms of the GNU General Public License version 3 or later.-Description:...

contains about 20 different implementations of SVMs)
• libsvm libsvm is a library of SVMs which is actively patched
• liblinear liblinear is a library for large linear classification including some SVMs
• flssvm flssvm is a least squares svm implementation written in fortran
• Shark Shark is a C++ machine learning library implementing various types of SVMs
• dlib dlib is a C++ library for working with kernel methods and SVMs
• SVM light is a collection of open-source software tools for learning and classification using SVM.