An
artificial neural network (ANN), usually called
neural networkThe term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...
(NN), is a
mathematical modelA mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used not only in the natural sciences and engineering disciplines A mathematical model is a...
or
computational modelA computational model is a mathematical model in computational science that requires extensive computational resources to study the behavior of a complex system by computer simulation. The system under study is often a complex nonlinear system for which simple, intuitive analytical solutions are...
that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of
artificial neuronAn artificial neuron is a mathematical function conceived as a crude model, or abstraction of biological neurons. Artificial neurons are the constitutive units in an artificial neural network...
s, and it processes information using a
connectionistConnectionism is a set of approaches in the fields of artificial intelligence, cognitive psychology, cognitive science, neuroscience and philosophy of mind, that models mental or behavioral phenomena as the emergent processes of interconnected networks of simple units...
approach to
computationComputation is defined as any type of calculation. Also defined as use of computer technology in Information processing.Computation is a process following a well-defined model understood and expressed in an algorithm, protocol, network topology, etc...
. In most cases an ANN is an
adaptive systemThe term adaptation arises mainly in the biological scope as a trial to study the relationship between the characteristics of living beings and their environments...
that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are non-linear statistical
data modelingData modeling in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.- Overview :...
tools. They are usually used to model complex relationships between inputs and outputs or to
find patternsIn machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...
in data.
Background
The original inspiration for the term
Artificial Neural Network came from examination of
central nervous systemThe central nervous system is the part of the nervous system that integrates the information that it receives from, and coordinates the activity of, all parts of the bodies of bilaterian animals—that is, all multicellular animals except sponges and radially symmetric animals such as jellyfish...
s and their
neuronA neuron is an electrically excitable cell that processes and transmits information by electrical and chemical signaling. Chemical signaling occurs via synapses, specialized connections with other cells. Neurons connect to each other to form networks. Neurons are the core components of the nervous...
s, axons, dendrites, and synapses, which constitute the processing elements of biological neural networks investigated by
neuroscienceNeuroscience is the scientific study of the nervous system. Traditionally, neuroscience has been seen as a branch of biology. However, it is currently an interdisciplinary science that collaborates with other fields such as chemistry, computer science, engineering, linguistics, mathematics,...
. In an artificial neural network, simple artificial nodes, variously called "
neuronsAn artificial neuron is a mathematical function conceived as a crude model, or abstraction of biological neurons. Artificial neurons are the constitutive units in an artificial neural network...
", "neurodes", "processing elements" (PEs) or "units", are connected together to form a network of nodes mimicking the biological neural networks — hence the term "artificial neural network".
Because neuroscience is still full of unanswered questions, and since there are many levels of abstraction and therefore many ways to take inspiration from the brain, there is no single formal definition of what an artificial neural network is. Generally, it involves a network of simple processing elements that exhibit complex global behavior determined by connections between processing elements and element parameters. While an artificial neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.
These networks are also similar to the biological neural networks in the sense that functions are performed collectively and in parallel by the units, rather than there being a clear delineation of subtasks to which various units are assigned (see also
connectionismConnectionism is a set of approaches in the fields of artificial intelligence, cognitive psychology, cognitive science, neuroscience and philosophy of mind, that models mental or behavioral phenomena as the emergent processes of interconnected networks of simple units...
). Currently, the term Artificial Neural Network (ANN) tends to refer mostly to neural network models employed in
statisticsStatistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
,
cognitive psychologyCognitive psychology is a subdiscipline of psychology exploring internal mental processes.It is the study of how people perceive, remember, think, speak, and solve problems.Cognitive psychology differs from previous psychological approaches in two key ways....
and
artificial intelligenceArtificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...
.
Neural networkThe term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...
models designed with emulation of the
central nervous systemThe central nervous system is the part of the nervous system that integrates the information that it receives from, and coordinates the activity of, all parts of the bodies of bilaterian animals—that is, all multicellular animals except sponges and radially symmetric animals such as jellyfish...
(CNS) in mind are a subject of theoretical neuroscience and
computational neuroscienceComputational neuroscience is the study of brain function in terms of the information processing properties of the structures that make up the nervous system...
.
In modern
software implementationsNeural network software is used to simulate, research, develop and apply artificial neural networks, biological neural networks and in some cases a wider array of adaptive systems.-Simulators:...
of artificial neural networks, the approach inspired by biology has been largely abandoned for a more practical approach based on statistics and signal processing. In some of these systems, neural networks or parts of neural networks (such as
artificial neuronAn artificial neuron is a mathematical function conceived as a crude model, or abstraction of biological neurons. Artificial neurons are the constitutive units in an artificial neural network...
s) are used as components in larger systems that combine both adaptive and non-adaptive elements. While the more general approach of such adaptive systems is more suitable for real-world problem solving, it has far less to do with the traditional artificial intelligence connectionist models. What they do have in common, however, is the principle of non-linear, distributed, parallel and local processing and adaptation.
Models
Neural network models in artificial intelligence are usually referred to as artificial neural networks (ANNs); these are essentially simple mathematical models defining a function

or a distribution over

or both

and

, but sometimes models are also intimately associated with a particular learning algorithm or learning rule. A common use of the phrase ANN model really means the definition of a
class of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons or their connectivity).
Network function
The word
network in the term 'artificial neural network' refers to the inter–connections between the neurons in the different layers of each system. An example system has three layers. The first layer has input neurons, which send data via synapses to the second layer of neurons, and then via more synapses to the third layer of output neurons. More complex systems will have more layers of neurons with some having increased layers of input neurons and output neurons. The synapses store parameters called "weights" that manipulate the data in the calculations.
An ANN is typically defined by three types of parameters:
- The interconnection pattern between different layers of neurons
- The learning process for updating the weights of the interconnections
- The activation function that converts a neuron's weighted input to its output activation.
Mathematically, a neuron's network function

is defined as a composition of other functions

, which can further be defined as a composition of other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between variables. A widely used type of composition is the
nonlinear weighted sum, where

, where

(commonly referred to as the
activation functionIn computational networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard computer chip circuit can be seen as a digital network of activation functions that can be "ON" or "OFF" , depending on input. This is similar to the behavior of...
) is some predefined function, such as the hyperbolic tangent. It will be convenient for the following to refer to a collection of functions

as simply a vector

.
This figure depicts such a decomposition of

, with dependencies between variables indicated by arrows. These can be interpreted in two ways.
The first view is the functional view: the input

is transformed into a 3-dimensional vector

, which is then transformed into a 2-dimensional vector

, which is finally transformed into

. This view is most commonly encountered in the context of optimization.
The second view is the probabilistic view: the
random variableIn probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

depends upon the random variable

, which depends upon

, which depends upon the random variable

. This view is most commonly encountered in the context of graphical models.
The two views are largely equivalent. In either case, for this particular network architecture, the components of individual layers are independent of each other (e.g., the components of

are independent of each other given their input

). This naturally enables a degree of parallelism in the implementation.
Networks such as the previous one are commonly called feedforward, because their graph is a
directed acyclic graphIn mathematics and computer science, a directed acyclic graph , is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of...
. Networks with
cyclesIn graph theory, a path in a graph is a sequence of vertices such that from each of its vertices there is an edge to the next vertex in the sequence. A path may be infinite, but a finite path always has a first vertex, called its start vertex, and a last vertex, called its end vertex. Both of them...
are commonly called
recurrentA recurrent neural network is a class of neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their internal memory to process...
. Such networks are commonly depicted in the manner shown at the top of the figure, where

is shown as being dependent upon itself. However, an implied temporal dependence is not shown.
Learning
What has attracted the most interest in neural networks is the possibility of
learning. Given a specific
task to solve, and a
class of functions

, learning means using a set of
observations to find

which solves the task in some
optimal sense.
This entails defining a cost function

such that, for the optimal solution

,

(i.e., no solution has a cost less than the cost of the optimal solution).
The cost function

is an important concept in learning, as it is a measure of how far away a particular solution is from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that has the smallest possible cost.
For applications where the solution is dependent on some data, the cost must necessarily be a
function of the observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a
statisticA statistic is a single measure of some attribute of a sample . It is calculated by applying a function to the values of the items comprising the sample which are known together as a set of data.More formally, statistical theory defines a statistic as a function of a sample where the function...
to which only approximations can be made. As a simple example, consider the problem of finding the model

, which minimizes

, for data pairs

drawn from some distribution

. In practical situations we would only have

samples from

and thus, for the above example, we would only minimize

. Thus, the cost is minimized over a sample of the data rather than the entire data set.
When

some form of
online machine learningIn machine learning, online learning is a model of induction that learns one instance at atime. The goal in online learning is to predict labels for instances.For example, the instances could describe the current conditions of...
must be used, where the cost is partially minimized as each new example is seen. While online machine learning is often used when

is fixed, it is most useful in the case where the distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently used for finite datasets.
Choosing a cost function
While it is possible to define some arbitrary,
ad hocAd hoc is a Latin phrase meaning "for this". It generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes. Compare A priori....
cost function, frequently a particular cost will be used, either because it has desirable properties (such as
convexityIn mathematics, a real-valued function f defined on an interval is called convex if the graph of the function lies below the line segment joining any two points of the graph. Equivalently, a function is convex if its epigraph is a convex set...
) or because it arises naturally from a particular formulation of the problem (e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the cost function will depend on the desired task. An overview of the three main categories of learning tasks is provided below.
Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task. These are
supervised learningSupervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
,
unsupervised learningIn machine learning, unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution...
and
reinforcement learningInspired by behaviorist psychology, reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward...
.
Supervised learning
In
supervised learningSupervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
, we are given a set of example pairs

and the aim is to find a function

in the allowed class of functions that matches the examples. In other words, we wish to
infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain.
A commonly used cost is the mean-squared error, which tries to minimize the average squared error between the network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost using
gradient descentGradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point...
for the class of neural networks called
multilayer perceptronA multilayer perceptron is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate output. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a...
s, one obtains the common and well-known
backpropagation algorithmBackpropagation is a common method of teaching artificial neural networks how to perform a given task. Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969 . It wasn't until 1974 and later, when applied in the context of neural networks and...
for training neural networks.
Tasks that fall within the paradigm of supervised learning are
pattern recognitionIn machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...
(also known as classification) and
regressionIn statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
(also known as function approximation). The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
Unsupervised learning
In
unsupervised learningIn machine learning, unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution...
, some data

is given and the cost function to be minimized, that can be any function of the data

and the network's output,

.
The cost function is dependent on the task (what we are trying to model) and our
a priori assumptions (the implicit properties of our model, its parameters and the observed variables).
As a trivial example, consider the model

, where

is a constant and the cost

. Minimizing this cost will give us a value of

that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: for example, in compression it could be related to the
mutual informationIn probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...
between

and

, whereas in statistical modeling, it could be related to the
posterior probabilityIn Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...
of the model given the data. (Note that in both of those examples those quantities would be maximized rather than minimized).
Tasks that fall within the paradigm of unsupervised learning are in general
estimationEstimation is the calculated approximation of a result which is usable even if input data may be incomplete or uncertain.In statistics,*estimation theory and estimator, for topics involving inferences about probability distributions...
problems; the applications include
clusteringCluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters....
, the estimation of statistical distributions,
compressionIn computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
and
filteringBayesian spam filtering is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.Bayesian classifiers work by correlating the use of tokens , with spam and non spam e-mails and then using Bayesian inference to calculate a probability that an...
.
Reinforcement learning
In
reinforcement learningInspired by behaviorist psychology, reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward...
, data

are usually not given, but generated by an agent's interactions with the environment. At each point in time

, the agent performs an action

and the environment generates an observation

and an instantaneous cost

, according to some (usually unknown) dynamics. The aim is to discover a
policy for selecting actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
More formally, the environment is modeled as a
Markov decision processMarkov decision processes , named after Andrey Markov, provide a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via...
(MDP) with states

and actions

with the following probability distributions: the instantaneous cost distribution

, the observation distribution

and the transition

, while a policy is defined as conditional distribution over actions given the observations. Taken together, the two define a
Markov chainA Markov chain, named after Andrey Markov, is a mathematical system that undergoes transitions from one state to another, between a finite or countable number of possible states. It is a random process characterized as memoryless: the next state depends only on the current state and not on the...
(MC). The aim is to discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Tasks that fall within the paradigm of reinforcement learning are control problems,
gameA game is structured playing, usually undertaken for enjoyment and sometimes used as an educational tool. Games are distinct from work, which is usually carried out for remuneration, and from art, which is more often an expression of aesthetic or ideological elements...
s and other sequential decision making tasks.
Learning algorithms
Training a neural network model essentially means selecting one model from the set of allowed models (or, in a
BayesianBayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
framework, determining a distribution over the set of allowed models) that minimizes the cost criterion. There are numerous algorithms available for training neural network models; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.
Most of the algorithms used in training artificial neural networks employ some form of
gradient descentGradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point...
. This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a
gradient-related direction.
Evolutionary methods,
simulated annealingSimulated annealing is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete...
, expectation-maximization, non-parametric methods and
particle swarm optimizationIn computer science, particle swarm optimization is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality...
are some commonly used methods for training neural networks.
Employing artificial neural networks
Perhaps the greatest advantage of ANNs is their ability to be used as an arbitrary function approximation mechanism that 'learns' from observed data. However, using them is not so straightforward and a relatively good understanding of the underlying theory is essential.
- Choice of model: This will depend on the data representation and the application. Overly complex models tend to lead to problems with learning.
- Learning algorithm: There are numerous trade-offs between learning algorithms. Almost any algorithm will work well with the correct hyperparameter
In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis...
s for training on a particular fixed data set. However selecting and tuning an algorithm for training on unseen data requires a significant amount of experimentation.
- Robustness: If the model, cost function and learning algorithm are selected appropriately the resulting ANN can be extremely robust.
With the correct implementation, ANNs can be used naturally in
online learningIn computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an offline algorithm is given the whole problem data from...
and large data set applications. Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware.
Applications
The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations. This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.
Real-life applications
The tasks artificial neural networks are applied to tend to fall within the following broad categories:
- Function approximation
The need for function approximations arises in many branches of applied mathematics, and computer science in particular. In general, a function approximation problem asks us to select a function among a well-defined class that closely matches a target function in a task-specific way.One can...
, or regression analysisIn statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
, including time series prediction, fitness approximationIn function optimization, fitness approximation is a method for decreasing the number of fitness function evaluations to reach a target solution...
and modeling.
- Classification, including pattern
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...
and sequence recognition, novelty detectionNovelty detection is the identification of new or unknown data or signals that a machine learning system is not aware of during training. Novelty detection is one-class classification. The known data form one class, and a novelty-detection method tries to identify outliers that differ from the...
and sequential decision making.
- Data processing
Computer data processing is any process that a computer program does to enter data and summarise, analyse or otherwise convert data into usable information. The process may be automated and run on a computer. It involves recording, analysing, sorting, summarising, calculating, disseminating and...
, including filtering, clustering, blind source separation and compression.
- Robotics
Robotics is the branch of technology that deals with the design, construction, operation, structural disposition, manufacture and application of robots...
, including directing manipulators, Computer numerical control.
Application areas include system identification and control (vehicle control, process control), quantum chemistry, game-playing and decision making (backgammon, chess,
pokerPoker is a family of card games that share betting rules and usually hand rankings. Poker games differ in how the cards are dealt, how hands may be formed, whether the high or low hand wins the pot in a showdown , limits on bet sizes, and how many rounds of betting are allowed.In most modern poker...
), pattern recognition (radar systems, face identification, object recognition and more), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications (automated trading systems),
data miningData mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
(or knowledge discovery in databases, "KDD"), visualization and
e-mail spamEmail spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...
filtering.
Neural networks and neuroscience
Theoretical and
computational neuroscienceComputational neuroscience is the study of brain function in terms of the information processing properties of the structures that make up the nervous system...
is the field concerned with the theoretical analysis and computational modeling of biological neural systems. Since neural systems are intimately related to cognitive processes and behavior, the field is closely related to cognitive and behavioral modeling.
The aim of the field is to create models of biological neural systems in order to understand how biological systems work. To gain this understanding, neuroscientists strive to make a link between observed biological processes (data), biologically plausible mechanisms for neural processing and learning (
biological neural networkIn neuroscience, a biological neural network describes a population of physically interconnected neurons or a group of disparate neurons whose inputs or signalling targets define a recognizable circuit. Communication between neurons often involves an electrochemical process...
models) and theory (statistical learning theory and
information theoryInformation theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
).
Types of models
Many models are used in the field defined at different levels of abstraction and modeling different aspects of neural systems. They range from models of the short-term behavior of individual neurons, models of how the dynamics of neural circuitry arise from interactions between individual neurons and finally to models of how behavior can arise from abstract neural modules that represent complete subsystems. These include models of the long-term, and short-term plasticity, of neural systems and their relations to learning and memory from the individual neuron to the system level.
Current research
While initial research had been concerned mostly with the electrical characteristics of neurons, a particularly important part of the investigation in recent years has been the exploration of the role of neuromodulators such as
dopamineDopamine is a catecholamine neurotransmitter present in a wide variety of animals, including both vertebrates and invertebrates. In the brain, this substituted phenethylamine functions as a neurotransmitter, activating the five known types of dopamine receptors—D1, D2, D3, D4, and D5—and their...
,
acetylcholineThe chemical compound acetylcholine is a neurotransmitter in both the peripheral nervous system and central nervous system in many organisms including humans...
, and
serotoninSerotonin or 5-hydroxytryptamine is a monoamine neurotransmitter. Biochemically derived from tryptophan, serotonin is primarily found in the gastrointestinal tract, platelets, and in the central nervous system of animals including humans...
on behavior and learning.
BiophysicalBiophysics is an interdisciplinary science that uses the methods of physical science to study biological systems. Studies included under the branches of biophysics span all levels of biological organization, from the molecular scale to whole organisms and ecosystems...
models, such as
BCM theoryBCM theory, BCM synaptic modification, or the BCM rule, named for Elie Bienenstock, Leon Cooper, and Paul Munro, is a physical theory of learning in the visual cortex developed in 1981...
, have been important in understanding mechanisms for
synaptic plasticityIn neuroscience, synaptic plasticity is the ability of the connection, or synapse, between two neurons to change in strength in response to either use or disuse of transmission over synaptic pathways. Plastic change also results from the alteration of the number of receptors located on a synapse...
, and have had applications in both computer science and neuroscience. Research is ongoing in understanding the computational algorithms used in the brain, with some recent biological evidence for radial basis networks and
neural backpropagationNeural backpropagation is the phenomenon in which the action potential of a neuron creates a voltage spike both at the end of the axon and back through to the dendritic arbor or dendrites, from which much of the original input current originated...
as mechanisms for processing data.
Computational devices have been created in CMOS for both biophysical simulation and neuromorphic computing. More recent efforts show promise for creating nanodevices for very large scale principal components analyses and
convolutionIn mathematics and, in particular, functional analysis, convolution is a mathematical operation on two functions f and g, producing a third function that is typically viewed as a modified version of one of the original functions. Convolution is similar to cross-correlation...
. If successful, these effort could usher in a new era of neural computing that is a step beyond digital computing, because it depends on
learningLearning is acquiring new or modifying existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information. The ability to learn is possessed by humans, animals and some machines. Progress over time tends to follow learning curves.Human learning...
rather than programming and because it is fundamentally
analogAn analog or analogue signal is any continuous signal for which the time varying feature of the signal is a representation of some other time varying quantity, i.e., analogous to another time varying signal. It differs from a digital signal in terms of small fluctuations in the signal which are...
rather than
digitalA digital system is a data technology that uses discrete values. By contrast, non-digital systems use a continuous range of values to represent information...
even though the first instantiations may in fact be with CMOS digital devices.
Neural network software
Neural network software is used to
simulateSimulation is the imitation of some real thing available, state of affairs, or process. The act of simulating something generally entails representing certain key characteristics or behaviours of a selected physical or abstract system....
,
researchResearch can be defined as the scientific search for knowledge, or as any systematic investigation, to establish novel facts, solve new or existing problems, prove new ideas, or develop new theories, usually using a scientific method...
, develop and apply artificial neural networks,
biological neural networkIn neuroscience, a biological neural network describes a population of physically interconnected neurons or a group of disparate neurons whose inputs or signalling targets define a recognizable circuit. Communication between neurons often involves an electrochemical process...
s and in some cases a wider array of
adaptive systemThe term adaptation arises mainly in the biological scope as a trial to study the relationship between the characteristics of living beings and their environments...
s.
Types of artificial neural networks
Artificial neural network types vary from those with only one or two layers of single direction logic, to complicated multi–input many directional feedback loop and layers. On the whole, these systems use algorithms in their programming to determine control and organization of their functions. Some may be as simple, one neuron layer with an input and an output, and others can mimic complex systems such as dANN, which can mimic chromosomal DNA through sizes at cellular level, into artificial organisms and simulate reproduction, mutation and population sizes.
Most systems use "weights" to change the parameters of the throughput and the varying connections to the neurons. Artificial neural networks can be autonomous and learn by input from outside "teachers" or even self-teaching from written in rules.
Computational power
The multi-layer
perceptronThe perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...
(MLP) is a universal function approximator, as proven by the Cybenko theorem. However, the proof is not constructive regarding the number of neurons required or the settings of the weights.
Work by
Hava SiegelmannHava Siegelmann is a computer scientist at the University of Massachusetts and director of the school's Biologically Inspired Neural and Dynamical Systems Lab. In the early 1990s she proposed a new computational model, the Artificial Recurrent Neural Network , and proved that it could perform...
and
Eduardo D. SontagEduardo Daniel Sontag is an American/Argentine mathematician, and Professor at the Rutgers University, who works in the field of systems biology and control theory and engineering.- Biography :...
has provided a proof that a specific recurrent architecture with rational valued weights (as opposed to full precision
real numberIn mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...
-valued weights) has the full power of a
Universal Turing MachineIn computer science, a universal Turing machine is a Turing machine that can simulate an arbitrary Turing machine on arbitrary input. The universal machine essentially achieves this by reading both the description of the machine to be simulated as well as the input thereof from its own tape. Alan...
using a finite number of neurons and standard linear connections. They have further shown that the use of irrational values for weights results in a machine with
super-TuringHypercomputation or super-Turing computation refers to models of computation that are more powerful than, or are incomparable with, Turing computability. This includes various hypothetical methods for the computation of non-Turing-computable functions, following super-recursive algorithms...
power.
Capacity
Artificial neural network models have a property called 'capacity', which roughly corresponds to their ability to model any given function. It is related to the amount of information that can be stored in the network and to the notion of complexity.
Convergence
Nothing can be said in general about convergence since it depends on a number of factors. Firstly, there may exist many local minima. This depends on the cost function and the model. Secondly, the optimization method used might not be guaranteed to converge when far away from a local minimum. Thirdly, for a very large amount of data or parameters, some methods become impractical. In general, it has been found that theoretical guarantees regarding convergence are an unreliable guide to practical application.
Generalization and statistics
In applications where the goal is to create a system that generalizes well in unseen examples, the problem of over-training has emerged. This arises in convoluted or over-specified systems when the capacity of the network significantly exceeds the needed free parameters. There are two schools of thought for avoiding this problem: The first is to use cross-validation and similar techniques to check for the presence of overtraining and optimally select hyperparameters such as to minimize the generalization error. The second is to use some form of
regularization. This is a concept that emerges naturally in a probabilistic (Bayesian) framework, where the regularization can be performed by selecting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimize over two quantities: the 'empirical risk' and the 'structural risk', which roughly corresponds to the error over the training set and the predicted error in unseen data due to overfitting.
Supervised neural networks that use an
MSEIn statistics, the mean squared error of an estimator is one of many ways to quantify the difference between values implied by a kernel density estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or...
cost function can use formal statistical methods to determine the confidence of the trained model. The MSE on a validation set can be used as an estimate for variance. This value can then be used to calculate the
confidence intervalIn statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...
of the output of the network, assuming a
normal distribution. A confidence analysis made this way is statistically valid as long as the output
probability distributionIn probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
stays the same and the network is not modified.
By assigning a
softmax activation functionThe softmax activation function is a neural transfer function. In neural networks, transfer functions calculate a layer's output from its net input. It is a biologically plausible approximation to the maximum operation...
on the output layer of the neural network (or a softmax component in a component-based neural network) for categorical target variables, the outputs can be interpreted as posterior probabilities. This is very useful in classification as it gives a certainty measure on classifications.
The softmax activation function is:
Dynamic properties
Various techniques originally developed for studying disordered magnetic systems (i.e., the
spin glassA spin glass is a magnet with frustrated interactions, augmented by stochastic disorder, where usually ferromagnetic and antiferromagnetic bonds are randomly distributed...
) have been successfully applied to simple neural network architectures, such as the Hopfield network. Influential work by E. Gardner and B. Derrida has revealed many interesting properties about
perceptronThe perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...
s with real-valued synaptic weights, while later work by W. Krauth and M. Mezard has extended these principles to binary-valued synapses.
Criticism
A common criticism of artificial neural networks, particularly in robotics, is that they require a large diversity of training for real-world operation. Dean Pomerleau, in his research presented in the paper "Knowledge-based Training of Artificial Neural Networks for Autonomous Robot Driving," uses a neural network to train a robotic vehicle to drive on multiple types of roads (single lane, multi-lane, dirt, etc.). A large amount of his research is devoted to (1) extrapolating multiple training scenarios from a single training experience, and (2) preserving past training diversity so that the system does not become overtrained (if, for example, it is presented with a series of right turns – it should not learn to always turn right). These issues are common in neural networks that must decide from amongst a wide variety of responses.
A. K. Dewdney, a former
Scientific AmericanScientific American is a popular science magazine. It is notable for its long history of presenting science monthly to an educated but not necessarily scientific public, through its careful attention to the clarity of its text as well as the quality of its specially commissioned color graphics...
columnist, wrote in 1997, "Although neural nets do solve a few toy problems, their powers of computation are so limited that I am surprised anyone takes them seriously as a general problem-solving tool." (Dewdney, p. 82)
Arguments for Dewdney's position are that to implement large and effective software neural networks, much processing and storage resources need to be committed. While the brain has hardware tailored to the task of processing signals through a
graphIn mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges...
of neurons, simulating even a most simplified form on Von Neumann technology may compel a NN designer to fill many millions of
databaseA database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
rows for its connections - which can lead to excessive
RAM-Animals:*Ram, an uncastrated male sheep*Ram cichlid, a species of freshwater fish endemic to Colombia and Venezuela-Military:*Battering ram*Ramming, a military tactic in which one vehicle runs into another...
and HD necessities. Furthermore, the designer of NN systems will often need to simulate the transmission of signals through many of these connections and their associated neurons - which must often be matched with incredible amounts of CPU processing power and time. While neural networks often yield effective programs, they too often do so at the cost of time and money efficiency.
Arguments against Dewdney's position are that neural nets have been successfully used to solve many complex and diverse tasks, ranging from autonomously flying aircraft to detecting credit card fraud. Technology writer Roger Bridgman commented on Dewdney's statements about neural nets:
Neural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable table...valueless as a scientific resource". In spite of his emphatic declaration that science is not technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers. An unreadable table that a useful machine could read would still be well worth having.
Some other criticisms came from believers of hybrid models (combining neural networks and symbolic approaches). They advocate the intermix of these two approaches and believe that hybrid models can better capture the mechanisms of the human mind (Sun and Bookman 1994).
Further reading
External links