Gradient descent is a first-order
optimizationIn mathematics, optimization, or mathematical programming, refers to choosing the best element from some set of available alternatives.In the simplest case, this means solving problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or...
algorithmIn mathematics, computing, linguistics, and related subjects, an algorithm is an effective method for solving a problem using a finite sequence of instructions. Algorithms are used for calculation, data processing, and many other fields....
. To find a local minimum of a function using gradient descent, one takes steps proportional to the
negative of the
gradientIn vector calculus, the gradient of a scalar field is a vector field which points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change....
(or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as
gradient ascent.
Gradient descent is also known as
steepest descent, or the
method of steepest descent.
Gradient descent is a first-order
optimizationIn mathematics, optimization, or mathematical programming, refers to choosing the best element from some set of available alternatives.In the simplest case, this means solving problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or...
algorithmIn mathematics, computing, linguistics, and related subjects, an algorithm is an effective method for solving a problem using a finite sequence of instructions. Algorithms are used for calculation, data processing, and many other fields....
. To find a local minimum of a function using gradient descent, one takes steps proportional to the
negative of the
gradientIn vector calculus, the gradient of a scalar field is a vector field which points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change....
(or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as
gradient ascent.
Gradient descent is also known as
steepest descent, or the
method of steepest descent. When known as the latter, gradient descent should not be confused with the
method of steepest descentIn mathematics, the steepest descent method or saddle-point approximation, is a method used to approximate integrals of the formwhere f is some twice-differentiable function, M is a large number, and the integral endpoints a and b could possibly be infinite...
for approximating integrals.
Description
Gradient descent is based on the observation that if the real-valued function is defined and differentiable in a neighborhood of a point , then decreases
fastest if one goes from in the direction of the negative gradient of at , . It follows that, if
for a small enough number, then . With this observation in mind, one starts with a guess for a local minimum of , and considers the sequence
such that
We have
so hopefully the sequence converges to the desired local minimum. Note that the value of the
step size is allowed to change at every iteration.
This process is illustrated in the picture to the right. Here is assumed to be defined on the plane, and that its graph has a
bowlA bowl is a common open-top container used in many cultures to serve food, and is also used for drinking and storing other items. They are typically small and shallow, although some, such as punch bowls and salad bowls, are larger and often intended to serve many people.Bowls have existed for...
shape. The blue curves are the
contour lineA contour line of a function of two variables is a curve along which the function has a constant value. In cartography, a contour line joins points of equal elevation above a given level, such as mean sea level...
s, that is, the regions on which the value of is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is orthogonal to the contour line going through that point. We see that gradient
descent leads us to the bottom of the bowl, that is, to the point where the value of the function is minimal.
Examples
Gradient descent has problems with pathological functions such as the
Rosenbrock functionIn mathematical optimization, the Rosenbrock function is a non-convex function used as a test problem for optimization algorithms. It is also known as Rosenbrock's valley or Rosenbrock's banana function....
shown here. The Rosenbrock function has a narrow curved valley which contains the minimum. The bottom of the valley is very flat. Because of the curved flat valley the optimization is zig-zagging slowly with small stepsizes towards the minimum.

The gradient ascent method applied to :
>
----
Gradient descent can also be used to solve a system of nonlinear equations. Below is an example that shows how to use the gradient descent to solve for three unknown variables, x
1, x
2, and x
3. This example shows one iteration of the gradient descent.
Consider a nonlinear system of equations:
With initial guess
There must be an initial assumption as to what the solutions for x
1, x
2, and x
3 are. In this case assume that all the solutions are zero.
The next step is to form the Jacobian matrix of the given system of equations.
Let,
Plug in the initial assumption of zero into the Jacobian matrix and original system to get the answers below.
Now the
gradientIn vector calculus, the gradient of a scalar field is a vector field which points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change....
and z
0 must be found. This is done by using the formula below and the previous two calculations made. The value for z
0 is found by simply calculating the p-norm of the gradient, where p is equal to 2.
For , we have and
Now the value for z must be found.
The next step is to find solutions for g
1, g
2, and g
3. First there must be a value chosen for α
1. Set α
1 to zero and α
3 to one.
There is a condition that must be met when the solutions for g
1 and g
3 are found. That condition is that g
3 must be less than g
1 and if this is true we accept α
1 and α
3. If this is not true then a different value must be chosen for α
3, one which will satisfy this condition. Try choosing a smaller value for α
3 like 1/2 or 1/4.
To find α
2 all that needs to be done is to divide α
3 by 2 → α
2 = α
3 / 2 .
With α
2 found, g
2 can be calculated.
The next step is to form a
Newton polynomialIn the mathematical field of numerical analysis, a Newton polynomial, named after its inventor Isaac Newton, is the interpolation polynomial for a given set of data points in the Newton form...
, Newton's forward divided difference interpolating polynomial.
Using all the calculations of α
1, α
2, α
3, g
1, g
2, and g
3 the unknown values of h
1, h
2, and h
3 can be found, which is shown below.
With all the values for h
1, h
2, and h
3 known, the next step is to plug them into the polynomial and solve for α. A value for g
0 must be found and must satisfy the condition of being less than g
1 and g
3. If this condition is met then use the following equation below to solve for x
(1) which involves the initial x
(0), α which becomes α
0, and z.
And now there is a set of solutions that satisfy the system of nonlinear equations.
Comments
Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. In the latter case the search space is typically a
function spaceIn mathematics, a function space is a set of functions of a given kind from a set X to a set Y. It is called a space because in many applications, it is a topological space or a vector space or both.-Examples:...
, and one calculates the
Gâteaux derivativeIn mathematics, the Gâteaux differential or Gâteaux derivative is a generalisation of the concept of directional derivative in differential calculus. Named after René Gâteaux, a French mathematician who died young in World War I, it is defined for functions between locally convex topological vector...
of the functional to be minimized to determine the descent direction.
Two weaknesses of gradient descent are:
- The algorithm can take many iterations to converge towards a local minimum, if the curvature in different directions is very different.
- Finding the optimal per step can be time-consuming. Conversely, using a fixed can yield poor results. Methods based on Newton's method
In mathematics, Newton's method is a well-known algorithm for finding roots of equations in one or more dimensions. It can also be used to find local maxima and local minima of functions by noticing that if a real number is a stationary point of a function , then is a root of the derivative ,...
and inversion of the HessianIn mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function; that is, it describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named...
using conjugate gradient techniques are often a better alternative.
A more powerful algorithm is given by the
BFGS methodIn mathematics, the Broyden–Fletcher–Goldfarb–Shanno method is a method to solve an unconstrained nonlinear optimization problem....
which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a "better" direction, combined with a more sophisticated line search algorithm, to find the "best" value of
Gradient descent is in fact Euler's method for solving ordinary differential equations applied to a gradient flow. As the goal is to find the minimum, not the flow line, the error in finite methods is less significant.
A computational example
The gradient descent algorithm is applied to find a local minimum of the function
f(
x)=
x4-3
x3+2 , with derivative
f'(
x)=4
x3-9
x2. Here is an implementation in the Python programming language.
- From calculation, we expect that the local minimum occurs at x=9/4
xOld = 0
xNew = 6 # The algorithm starts at x=6
eps = 0.01 # step size
precision = 0.00001
def f_prime(x):
return 4 * x**3 - 9 * x**2
while abs(xNew - xOld) > precision:
xOld = xNew
xNew = xNew - eps * f_prime(xNew)
print "Local minimum occurs at", xNew
With this precision, the algorithm converges to a local minimum of 2.24996 in 70 iterations.
A more robust implementation of the algorithm would also check whether the function value indeed decreases at every iteration and would make the step size smaller otherwise. One can also use an adaptive step size which may make the algorithm converge faster.
See also
- Conjugate gradient
- Stochastic gradient descent
thumb|right|Fluctuations in the total objective function as gradient steps with respect to mini-batches are takenStochastic gradient descent is a general optimization algorithm, but is typically used to fit the parameters of a machine learning model....
- Newton's method
In mathematics, Newton's method is a well-known algorithm for finding roots of equations in one or more dimensions. It can also be used to find local maxima and local minima of functions by noticing that if a real number is a stationary point of a function , then is a root of the derivative ,...
- Optimization
In mathematics, optimization, or mathematical programming, refers to choosing the best element from some set of available alternatives.In the simplest case, this means solving problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or...
- Line search
- Delta rule
The delta rule is a gradient descent learning rule for updating the weights of the artificial neurons in a single-layer perceptron. For a neuron with activation function the delta rule for 's th weight is given by,...
- Wolfe conditions
In unconstrained minimization, the Wolfe conditions are a set of inequalities for performing inexact line search, especially in quasi-Newton methods. Inexact line searches provide an efficient way of computing an acceptable step length that reduces the cost 'sufficiently', rather than minimizing...