Ensemble Averaging
Encyclopedia
In machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

, particularly in the creation of artificial neural networks, ensemble averaging is the process of creating multiple models and combining them to produce a desired output, as opposed to creating just one model. Frequently an ensemble of models performs better than any individual model, because the various errors of the models "average out."

Overview

Ensemble averaging is one of the simplest types of committee machine
Committee machine
A committee machine is a type of neural network using a divide and conquer strategy in which the responses of multiple neural networks are combined into a single response. The combined response of the committee machine is supposed to be superior to those of its constituent experts...

s. Along with boosting
Boosting
Boosting is a machine learning meta-algorithm for performing supervised learning. Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification...

, it is one of the two major types of static committee machines. In contrast to standard network design in which many networks are generated but only one is kept, ensemble averaging keeps the less satisfactory networks around, but with less weight. The theory of ensemble averaging relies on two properties of artificial neural networks:
  1. In any network, the bias can be reduced at the cost of increased variance
  2. In a group of networks, the variance can be reduced at no cost to bias


Ensemble averaging creates a group of networks, each with low bias and high variance, then combines them to a new network with (hopefully) low bias and low variance. It is thus a resolution of the bias/variance dilemma. The idea of combining experts has been traced back to Pierre-Simon Laplace
Pierre-Simon Laplace
Pierre-Simon, marquis de Laplace was a French mathematician and astronomer whose work was pivotal to the development of mathematical astronomy and statistics. He summarized and extended the work of his predecessors in his five volume Mécanique Céleste...

.

Method

The theory mentioned above gives an obvious strategy: create a set of experts with low bias and high variance, and then average them. Generally, what this means is to create a set of experts with varying parameters; frequently, these are the initial synaptic weights, although other factors (such as the learning rate, momentum etc.) may be varied as well. Some authors recommend against varying weight decay and early stopping. The steps are therefore:
  1. Generate N experts, each with their own initial values. (Initial values are usually chosen randomly from a distribution.)
  2. Train each expert separately.
  3. Combine the experts and average their values.

Alternatively, domain knowledge
Domain knowledge
Domain knowledge is that valid knowledge used to refer to an area of human endeavour, an autonomous computer activity, or other specialized discipline.Specialists and experts use and develop their own domain knowledge...

 may be used to generate several classes of experts. An expert from each class is trained, and then combined.

A more complex version of ensemble average views the final result not as a mere average of all the experts, but rather as a weighted sum. If each expert is , than the overall result can be definied as:

where is a set of weights. The optimization problem of finding alpha is readily solved through neural networks, hence a "meta-network" where each "neuron" is in fact an entire neural network can be trained, and the synaptic weights of the final network is the weight applied to each expert. This is known as a linear combination of experts.

It can be seen that most forms of neural networks are some subset of a linear combination: the standard neural net (where only one expert is used) is simply a linear combination with all and one . A raw average is where all are equal to some constant value, namely one over the total number of experts.

A more recent ensemble averaging method is negative correlation learning , proposed by Y. Liu and X. Yao. Now this method has been widely used in evolutionary computing.

Benefits

  • The resulting committee is almost always less complex than a single network which would achieve the same level of performance
  • The resulting committee can be trained more easily on smaller input sets
  • The resulting committee often has improved performance over any single network
  • The risk of overfitting
    Overfitting
    In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...

    is lessened, as there are fewer parameters (weights) which need to be set
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK