Feature selection - AbsoluteAstronomy.com

Machine learning

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

and statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models. When applied in biology

Biology

Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...

domain, the technique is also called discriminative gene selection, which detects influential genes

Gênes

Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...

based on DNA microarray

DNA microarray

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome...

experiments. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by:

Alleviating the effect of the curse of dimensionality
Curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.There are multiple phenomena referred to by this name in...

.
Enhancing generalization capability.
Speeding up learning process.
Improving model interpretability.

Feature selection also helps people to acquire better understanding about their data by telling them which are the important features and how they are related with each other.

Introduction

Simple feature selection algorithms are ad hoc, but there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning

Supervised learning

Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set.

Feature selection algorithms typically fall into two categories: feature ranking and subset selection. Feature ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. Subset selection searches the set of possible features for the optimal subset.

In statistics, the most popular form of feature selection is stepwise regression

Stepwise regression

In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure...

. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound

Branch and bound

Branch and bound is a general algorithm for finding optimal solutions of various optimization problems, especially in discrete and combinatorial optimization...

and piecewise linear network.

Subset selection

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken into Wrappers, Filters and Embedded. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in and specific to a model.

Many popular search approaches use greedy

Greedy algorithm

A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....

hill climbing

Hill climbing

In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by incrementally changing a single element of the solution...

, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric

Metric (mathematics)

In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...

that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.

Alternative search-based techniques are based on targeted projection pursuit

Targeted projection pursuit

Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection...

which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower dimensional space are then selected.

Search approaches include:

Exhaustive
Best first
Simulated annealing
Simulated annealing
Simulated annealing is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete...
Genetic algorithm
Genetic algorithm
A genetic algorithm is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems...
Greedy
Greedy algorithm
A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....

forward selection
Greedy
Greedy algorithm
A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....

backward elimination
Targeted projection pursuit
Targeted projection pursuit
Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection...

Two popular filter metrics for classification problems are correlation

Correlation

In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....

and mutual information

Mutual information

In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...

, although neither are true metrics

Metric (mathematics)

or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality

Triangle inequality

In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side ....

and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category.

Other available filter metrics include:

Class separability
- Error probability
- Inter-class distance
- Probabilistic distance
- Entropy
Consistency-based feature selection
Correlation-based feature selection

Optimality criteria

There are a variety of optimality criteria that can be used for controlling feature selection. The oldest are Mallows' Cp

Mallows' Cp

In statistics, Mallows' Cp, named for Colin L. Mallows, is used to assess the fit of a regression model that has been estimated using ordinary least squares. It is applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the...

statistic and Akaike information criterion

Akaike information criterion

The Akaike information criterion is a measure of the relative goodness of fit of a statistical model. It was developed by Hirotsugu Akaike, under the name of "an information criterion" , and was first published by Akaike in 1974...

(AIC). These add variables if the t-statistic

Student's t-test

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known...

is bigger than

.

Other criteria are Bayesian information criterion (BIC) which uses

, minimum description length

Minimum description length

The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978...

(MDL) which asymptotically uses

, Bonnferroni / RIC which use

, maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate

False discovery rate

False discovery rate control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses...

(FDR) which use something close to

Minimum-redundancy-maximum-relevance (mRMR) feature selection

Peng et al. proposed an mRMR feature-selection method that can use either mutual information, correlation, distance/similarity scores to select features. For example, with mutual information, relevant features and redundant features are considered simultaneously. The relevance of a feature set

for the class

is defined by the average value of all mutual information values between the individual feature

and the class

as follows:

.

The redundancy of all features in the set

is the average value of all mutual information values between the feature

and the feature

The mRMR criterion is a combination of two measures given above and is defined as follows:

Suppose that there are

full-set features. The binary values of the variable

are used in order to indicate the appearance (

) or the absence (

) of the feature

in the globally optimal feature set. The mutual information values

are denoted by constants

, respectively. Therefore, the problem (3) can be described as an optimization problem as follows:

Peng et al. shows that mRMR feature selection is an approximation of the theoretically optimal maximum-dependency feature selection that maximizes the mutual information between the joint distribution of the selected features and the classification variable. However, since mRMR turned a combinatorial problem as a series of much smaller scale problems, each of which only involves two variables, the estimation of joint probabilities is much more robust. Overall the algorithm is much more efficient than the theoretically optimal max-dependency selection, while it is more robust to find useful features.

It can be seen that mRMR is also related to the correlation based feature selection below. It may also be seen a special case of some generic feature selectors .

Correlation feature selection

The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other" . The following equation gives the merit of a feature subset

consisting of

features:

Here,

is the average value of all feature-classification correlations, and

is the average value of all feature-feature correlations. The CFS criterion is defined as follows:

By using binary values of the variable

as in the case of the mRMR measure to indicate the appearance or the absence of the feature

, the problem (4) can be rewritten as an optimization problem as follows:

Combinatorial problems (2) and (4) are, in fact, mixed 0-1 linear programming problems that can be solved by using branch-and-bound algorithms .

Embedded methods incorporating feature selection

Random multinomial logit
Random multinomial logit
In statistics and machine learning, random multinomial logit is a technique for statistical classification using repeated multinomial logit analyses via Leo Breiman's random forests.-Rationale for the new method:...

(RMNL)
Sparse regression, LASSO
Regularized trees e.g. regularized random forest implemented in the RRF package
Decision tree
Decision tree learning
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...
Memetic algorithm
Memetic algorithm
Memetic algorithms represent one of the recent growing areas of research in evolutionary computation. The term MA is now widely used as a synergy of evolutionary or any population-based approach with separate individual learning or local improvement procedures for problem search...
Auto-encoding networks with a bottleneck-layer
Many other machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

methods applying a pruning
Pruning (algorithm)
Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances...

step.

Software for feature selection

Many standard data analysis software systems are often used for feature selection, such as SciLab

Scilab

Scilab is an open source, cross-platform numerical computational package and a high-level, numerically oriented programming language. Itcan be used for signal processing, statistical analysis, image enhancement, fluid dynamics simulations, numerical optimization, and modeling and simulation of...

, NumPy and the R language

R (programming language)

R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

. Other software systems are tailored specifically to the feature-selection task:

Weka
Weka (machine learning)
Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand...

– freely available and open-source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

software in Java.
Feature Selection Toolbox 3
Feature Selection Toolbox
Feature Selection Toolbox is a machine learning software focusing primarily on the feature selection problem, written in C++, developed at the Institute of Information Theory and Automation , of the Czech Academy of Sciences....

– freely available and open-source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

software in C++.
RapidMiner – freely available and open-source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

software.
Orange
Orange (software)
Orange is a component-based data mining and machine learning software suite, featuring friendly yet powerful and flexible visual programming front-end for explorative data analysis and visualization, and Python bindings and libraries for scripting...

– freely available and open-source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

software (module orngFSS).
TOOLDIAG Pattern recognition toolbox – freely available C toolbox.
minimum redundancy feature selection tool – freely available C/Matlab codes for selecting minimum redundant features.
A C# Implementation of greedy forward feature subset selection for various classifiers (e.g., LibLinear, SVM-light).
MCFS-ID (Monte Carlo Feature Selection and Interdependency Discovery) is a Monte Carlo method-based tool for feature selection. It also allows for the discovery of interdependencies between the relevant features. MCFS-ID is particularly suitable for the analysis of high-dimensional, ill-defined transactional and biological data.
RRF is a relatively new R package for feature selection and can be installed from R. RRF stands for Regularized Random Forest, which is a type of Regularized Trees. By building a regularized random forest, a compact set of non-redundant features can be selected without loss of predictive information. Regularized trees can capture non-linear interactions between variables, and naturally handle different scales, and numerical and categorical variables.

External links

NIPS challenge 2003 (see also NIPS)
Naive Bayes implementation with feature selection in Visual Basic (includes executable and source code)
Minimum-redundancy-maximum-relevance (mRMR) feature selection program

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.