L-BFGS - AbsoluteAstronomy.com

The limited-memory BFGS algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

is a member of the broad family of quasi-Newton

Quasi-Newton method

In optimization, quasi-Newton methods are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on...

optimization

Optimization (mathematics)

In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....

methods that uses a limited memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS)

BFGS method

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno method is a method for solving nonlinear optimization problems ....

update to approximate the inverse Hessian matrix

Hessian matrix

In mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function; that is, it describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named...

(denoted by

). Unlike the original BFGS

BFGS method

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno method is a method for solving nonlinear optimization problems ....

method which stores a dense

approximation, L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its moderate memory requirement, L-BFGS method is particularly well suited for optimization problems with a large number of variables. L-BFGS never explicitly forms or stores

. Instead, it maintains a history of the past

updates of the position

and gradient

, where generally the history

can be short, often less than 10. These updates are used to implicitly do operations requiring the

-vector product. While strictly, a straightforward BFGS implementation at the

-th iteration would represent the inverse Hessian approximation as informed by all updates on

, L-BFGS does quite well using updates from only the most recent iterations

Representation

An L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication for finding the search direction is carried out

. There are multiple published approaches to using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."

We'll take as given

, the position at the

-th iteration, and

where

is the function being minimized, and all vectors are column vectors. Then we keep the updates

and

. We'll define

will be the 'initial' approximate of the inverse Hessian that our estimate at iteration

begins with.
Then we can compute the (uphill) direction as follows:

For

Stop with

This formulation is valid whether we are minimizing or maximizing. Note that if we are minimizing, the search direction would be the negative of z (since z is "uphill"), and if we are maximizing,

should be negative definite rather than positive definite. We would typically do a backtracking line search in the search direction (any line search would be valid, but L-BFGS does not require exact line searches in order to converge).

Commonly, the inverse Hessian

is represented as a diagonal matrix, so that initially setting

requires only an element-by-element multiplication.

This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian

have also been developed, as have other means of approximating the inverse Hessian.

Applications

L-BFGS is called "the algorithm of choice" for fitting log-linear (MaxEnt) models

Multinomial logit

In statistics, economics, and genetics, a multinomial logit model, also known as multinomial logistic regression, is a regression model which generalizes logistic regression by allowing more than two discrete outcomes...

and conditional random field

Conditional random field

A conditional random field is a statistical modelling method often applied in pattern recognition.More specifically it is a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations...

s with

-regularization

Regularization (mathematics)

In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...

by Andrew and Gao (who developed a variant called OWL-QN

Orthant-wise limited-memory quasi-Newton

Orthant-wise limited-memory quasi-Newton is a numerical optimization algorithm that belongs to the class of quasi-Newton methods, and is specifically designed to serve in the training/fitting algorithm of log-linear models with \ell_1-regularization...

for

-regularized models).

Implementations

An early, open source implementation of L-BFGS in Fortran exists in Netlib

Netlib

Netlib is a repository of software for scientific computing maintained by AT&T, Bell Laboratories, the University of Tennessee and Oak Ridge National Laboratory. Netlib comprises a large number of separate programs and libraries...

as a shar

Shar

In the Unix operating system, shar is an archive format. A shar file is a shell script, and executing it will recreate the files. This is a type of self-extracting archive file. It can be created with the Unix shar utility...

archive http://netlib.org/opt/lbfgs_um.shar. Multiple other open source implementations have been produced by as translations of this Fortran code (e.g. java, and python via SciPy

SciPy

SciPy is an open source library of algorithms and mathematical tools for the Python programming language.SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and...

). Other implementations exist (e.g. Matlab (optimization toolbox) Matlab (BSD)), frequently as part of generic optimization libraries (e.g. Mathematica). A variant which can handle box-constraints on the variables, L-BFGS-B, also exists as ACM TOMS algorithm 778, and can be called from R's

R (programming language)

R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

optim general-purpose optimizer routing by using method="L-BFGS-B". The libLBFGS is a C

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

implementation.

Representation

Applications

Implementations

Further reading