Soft independent modelling of class analogies
Encyclopedia
Soft independent modelling by class analogy (SIMCA) is a statistical
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

 method for supervised classification
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

 of data. The method requires a training data set consisting of samples (or objects) with a set of attributes and their class membership. The term soft refers to the fact the classifier can identify samples as belonging to multiple classes and not necessarily producing a classification of samples into non-overlapping classes.

Method

In order to build the classification models, the samples belonging to each class need to be analysed using principal components analysis
Principal components analysis
Principal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. The number of principal components is less than or equal to...

 (PCA); only the significant components are retained.

For a given class, the resulting model then describes either a line (for one Principal Component or PC), plane (for two PCs) or hyper-plane (for more than two PCs). For each modelled class, the mean orthogonal distance of training data samples from the line, plane or hyper-plane (calculated as the residual standard deviation) is used to determine a critical distance for classification. This critical distance is based on the F-distribution and is usually calculated using 95% or 99% confidence intervals.

New observations are projected into each PC model and the residual distances calculated. An observation is assigned to the model class when its residual distance from the model is below the statistical limit for the class. The observation may be found to belong to multiple classes and a measure of goodness of the model
Goodness of fit
The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g...

 can be found from the number of cases where the observations are classified into multiple classes. The classification efficiency is usually indicated by Receiver operating characteristic
Receiver operating characteristic
In signal detection theory, a receiver operating characteristic , or simply ROC curve, is a graphical plot of the sensitivity, or true positive rate, vs. false positive rate , for a binary classifier system as its discrimination threshold is varied...

s.

In the original SIMCA method, the ends of the hyper-plane of each class are closed off by setting statistical control limits along the retained principal components axes (i.e. range: minimum score value minus 0.5 times score standard deviation to maximum score value plus 0.5 times standard deviation).

More recent adaptations of the SIMCA method close off the hyper-plane by construction of ellipsoids (e.g. Hotelling's T2
Hotelling's T-square distribution
In statistics Hotelling's T-squared distribution is important because it arises as the distribution of a set of statistics which are natural generalisations of the statistics underlying Student's t distribution...

 or Mahalanobis distance
Mahalanobis distance
In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean...

). With such modified SIMCA methods, classification of an object requires both that its orthogonal distance from the model and its projection within the model (i.e. score value within region defined by ellipsoid) are not significant.

Application

SIMCA as a method of classification has gained widespread use especially in applied statistical fields such as chemometrics
Chemometrics
Chemometrics is the science of extracting information from chemical systems by data-driven means. It is a highly interfacial discipline, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to...

and spectroscopic data analysis.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK