All Topics  
Floating point

 

   Email Print
   Bookmark   Link






 

Floating point



 
 
In computing
Computing

Computing is usually defined as the activity of using and developing computer technology, computer hardware and computer software. It is the computer-specific part of information technology....
, floating point describes a system for numerical representation in which a string
String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet....
 of digits (or bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
s) represents a rational number
Rational number

In mathematics, a rational number is a number which can be expressed as a quotient of two integers. Non-integer rational numbers are usually written as the vulgar fraction , where b is not 0 ....
.

The term floating point refers to the fact that the radix point
Radix point

In mathematics and computing, a radix point is the symbol used in numerical representations to separate the integer part of a number from its fraction part ....
 (decimal point, or, more commonly in computers, binary point) can "float": that is, it can be placed anywhere relative to the significant digits
Significant figures

The significant figures of a number are those Numerical digit that carry meaning contributing to its accuracy . This includes all digits except:...
 of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation
Scientific notation

Scientific notation, also known as standard form or as exponential notation, is a way of writing numbers that accommodates values too large or small to be conveniently written in standard decimal notation....
.






Discussion
Ask a question about 'Floating point'
Start a new discussion about 'Floating point'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In computing
Computing

Computing is usually defined as the activity of using and developing computer technology, computer hardware and computer software. It is the computer-specific part of information technology....
, floating point describes a system for numerical representation in which a string
String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet....
 of digits (or bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
s) represents a rational number
Rational number

In mathematics, a rational number is a number which can be expressed as a quotient of two integers. Non-integer rational numbers are usually written as the vulgar fraction , where b is not 0 ....
.

The term floating point refers to the fact that the radix point
Radix point

In mathematics and computing, a radix point is the symbol used in numerical representations to separate the integer part of a number from its fraction part ....
 (decimal point, or, more commonly in computers, binary point) can "float": that is, it can be placed anywhere relative to the significant digits
Significant figures

The significant figures of a number are those Numerical digit that carry meaning contributing to its accuracy . This includes all digits except:...
 of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation
Scientific notation

Scientific notation, also known as standard form or as exponential notation, is a way of writing numbers that accommodates values too large or small to be conveniently written in standard decimal notation....
. Over the years several different floating-point representations have been used in computers; however, for the last ten years the most commonly encountered representation is that defined by the IEEE 754 Standard.

The advantage of floating-point representation over fixed-point
Fixed-point arithmetic

In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after the radix point . Fixed-point number representation can be compared to the more complicated floating point number representation....
 (and integer
Integer (computer science)

In computer science, the term integer is used to refer to a data type which represents some finite subset of the mathematical integers. These are also known as integral data types....
) representation is that it can support a much wider range of values. For example, a fixed-point representation that has seven decimal digits, with the decimal point assumed to be positioned after the fifth digit, can represent the numbers 12345.67, 8765.43, 123.00, and so on, whereas a floating-point representation (such as the IEEE 754 decimal32 format) with seven decimal digits could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of slightly less precision
Accuracy and precision

In the fields of science, engineering, industry and statistics, accuracy is the degree of closeness of a Measure d or calculated quantity to its actual Value ....
.

The speed of floating-point operations is an important measure of performance for computers in many application domains. It is measured in FLOPS
FLOPS

In computing, FLOPS is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's computer performance, especially in fields of scientific calculations that make heavy use of floating point calculations, similar to instructions per second....
.

World-class supercomputer installations
TOP500

The TOP500 project ranks and details the 500 most powerful known computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year....
 are generally rated in teraflops. In June 2008, the IBM Roadrunner supercomputer achieved 1.026 petaflops, or 1.026 quadrillion floating-point operations per second. In November 2008, Oak Ridge National Laboratory
Oak Ridge National Laboratory

Oak Ridge National Laboratory is a multiprogram science and technology national laboratory managed for the United States Department of Energy by UT-Battelle....
's supercomputer was upgraded to hit a theoretical peak computing power of 1.64 petaflops making Jaguar the world's first petaflop system dedicated to open research.

Overview


A number representation (called a numeral system
Numeral system

A numeral system is a writing system for expressing numerals , and a mathematical notation for representing numbers of a given set, using graphemes or symbols in a consistent manner....
 in mathematics) specifies some way of storing a number that may be encoded as a string of digits. The arithmetic is defined as a set of actions on the representation that simulate classical arithmetic operations.

There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the radix point
Radix point

In mathematics and computing, a radix point is the symbol used in numerical representations to separate the integer part of a number from its fraction part ....
 is indicated by placing an explicit "point" character
Decimal separator

In a Positional notation numeral system, the decimal separator is a symbol used to mark the boundary between the integer and the fraction parts of a decimal numeral....
 (dot or comma) there. If the radix point is omitted then it is implicitly assumed to lie at the right (least significant) end of the string (that is, the number is an integer
Integer

The integers are natural numbers including 0 and their negative and non-negative numberss . They are numbers that can be written without a fractional or decimal component, and fall within the set ....
). In fixed-point
Fixed-point arithmetic

In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after the radix point . Fixed-point number representation can be compared to the more complicated floating point number representation....
 systems, some specific assumption is made about where the radix point is located in the string. For example, the convention could be that the string consists of 8 decimal digits with the decimal point in the middle, so that "00012345" has a value of 1.2345.

In scientific notation
Scientific notation

Scientific notation, also known as standard form or as exponential notation, is a way of writing numbers that accommodates values too large or small to be conveniently written in standard decimal notation....
, the given number is scaled by a power of 10 so that it lies within a certain range—typically between 1 and 10, with the radix point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter's moon Io
Io (moon)

'Io' is the innermost of the four Galilean moons natural satellite of Jupiter and, with a diameter of 3,642 Kilometre, the List of moons by diameter in the Solar System....
 is 152853.5047 seconds. This is represented in standard-form scientific notation as 1.528535047 seconds.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

  • A signed digit string of a given length in a given base
    Base (mathematics)

    In arithmetic, the base refers to the number b in an expression of the form bn. The number n is called the exponent and the expression is known formally as exponentiation of b by n or the exponential of n with base b....
     (or radix
    Radix

    In numeral system, the base or radix is usually the number of unique Numerical digit, including zero, that a Positional notation numeral system uses to represent numbers....
    ). This is known as the significand
    Significand

    The significand is the part of a floating point that contains its significant digits. Depending on the interpretation of the exponent, the significand may be considered to be an integer or a fraction ....
    , or sometimes the mantissa
    Mantissa

    Mantissa may refer to:* Mantissa * Significand, part of a floating-point number* Part of a common logarithm* A novel by John Fowles...
     (see below) or coefficient. The radix point is not explicitly included, but is implicitly assumed to always lie in a certain position within the significand—often just after or just before the most significant digit, or to the right of the rightmost digit. This article will generally follow the convention that the radix point is just after the most significant (leftmost) digit. The length of the significand determines the precision to which numbers can be represented.
  • A signed integer exponent, also referred to as the characteristic or scale, which modifies the magnitude of the number.


The significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.

Using base-10 (the familiar decimal
Decimal representation

A decimal representation of a non-negative real number r is an expression of the formwhere a0 is a nonnegative integer, and a1,...
 notation) as an example, the number 152853.5047, which has ten decimal digits of precision, is represented as the significand 1528535047 together with an exponent of 5 (if the implied position of the radix point is after the first most significant digit, here 1). To recover the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 105 to give 1.528535047 × 105, or 152853.5047.

Symbolically, this final value is

where s is the value of the significand (after taking into account the implied radix point), b is the base, and e is the exponent.

Equivalently, this is:


where s here means the integer value of the entire significand, ignoring any implied decimal point, and p is the precision—the number of digits in the significand.

Historically, different bases have been used for representing floating-point numbers, with base 2 (binary
Binary numeral system

The binary numeral system, or notation with a radix of 2. Owing to its straightforward implementation in digital electronic circuitry using logic gates, the binary system is used internally by all modern computers....
) being the most common, followed by base 10 (decimal), and other less common varieties such as base 16.

The way in which the significand, exponent and sign bits are internally stored on a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation p=24 and so the significand is a string of 24 bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
s (1s and 0s). For instance, the number p
P

P is the sixteenth letter of the modern Latin alphabet. Its name in English language is pronounced pee ....
 rounded to 24 bits is 11.001001000011111101101. When this is stored using the IEEE 754 encoding, this is represented as s = 110010010000111111011011 with e = 1 (where s is assumed to have a binary point after the first bit), left-adjusted (or normalized) so the first first is the first non-zero bit (if there is one). Since this first bit of a non-zero binary significand is always 1 it need not be stored, giving an extra bit of precision. Normalization can therefore be thought of as a form of compression; it allows a binary significand to be compressed into a field one bit shorter than the maximum precision, at the expense of extra processing.

The word "mantissa" is often used as a synonym for significand. Many people do not consider this usage to be correct, because the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm
Common logarithm

The common logarithm is the logarithm with base 10. It is also known as the decadic logarithm, named after its base. It is indicated by log10, or sometimes Log with a capital L ....
 tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand.

Range of floating-point numbers


By allowing the radix point
Radix point

In mathematics and computing, a radix point is the symbol used in numerical representations to separate the integer part of a number from its fraction part ....
 to be adjustable, floating-point notation allows calculations over a wide range of magnitudes, using a fixed number of digits, while maintaining good precision. For example, in a decimal floating-point system with three digits, the multiplication that humans would write as
0.12 × 0.12 = 0.0144
would be expressed as × (1.2) = (1.44) In a fixed-point system with the decimal point at the left, it would be
0.120 × 0.120 = 0.014
A digit of the result was lost because of the inability of the digits and decimal point to 'float' relative to each other within the digit string.

The range of floating-point numbers depends on the number of bits or digits used for representation of the significand (the significant digits of the number) and for the exponent. On a typical computer system, a 'double precision' (64-bit) binary floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. Positive floating-point numbers in this format have an approximate range of 10−308 to 10308 (because 308 is approximately 1023 × log10(2), since the range of the exponent is [−1022,1023]). The complete range of the format is from about −10308 through +10308 (see IEEE 754).

The number of normalized floating point numbers in a system F(B, P, L, U) (where B is the base of the system, P is the precision of the system to P numbers, L is the smallest exponent representable in the system, and U is the largest exponent used in the system) is: 2 * (B - 1) * B^(P-1) * (U - L + 1) + 1. The one is added because the number could be zero.

There is a smallest positive normalized floating-point number, Underflow level = UFL = B^L which has a 1 as the leading digit and 0 for the remaining digits of the mantissa, and the smallest possible value for the exponent.

There is a largest floating point number, Overflow level = OFL = B^(U + 1) * (1 - B^(-P)) which has B - 1 as the value for each digit of the mantissa and the largest possible value for the exponent.

Any number larger than OFL cannot be represented in the given floating-point system, and no number smaller than the UFL can be represented in the floating point system.

History

In 1938, Konrad Zuse
Konrad Zuse

Konrad Zuse was a Germany Civil engineering and computer pioneer. His greatest achievement was the world's first functional program-controlled Turing-complete computer, the Z3 , in 1941 ....
 of Berlin, completed the "Z1
Z1 (computer)

The Z1 was a mechanical computer created by Konrad Zuse in 1936. It was a binary electrically driven mechanical calculator with limited programmability, reading instructions from punched tape....
", the first mechanical binary programmable computer. It was based on boolean algebra and had most of the basic ingredients of modern machines, using the binary system and today's standard separation of storage
Computer storage

Computer data storage, often called storage or memory, refers to computer components, devices, and recording medium that retain digital data used for computing for some interval of time....
 and control. Zuse's 1936 patent application (Z23139/GMD Nr. 005/021) also suggests a von Neumann architecture
Von Neumann architecture

The von Neumann architecture is a design model for a stored-program digital computer that uses a central processing unit and a single separate computer storage structure to hold both instructions and data ....
 (re-invented in 1945) with program and data modifiable in storage. Originally the machine was called the "V1" but it was retroactively renamed after the war, to avoid confusion with the V1 missile. It worked with floating-point numbers having a 7-bit exponent, 16-bit significand, and a sign bit. The memory used sliding metal parts to store 16 such numbers, and worked well; but the arithmetic unit was less successful, occasionally suffering from certain mechanical engineering problems. The program was read from punched discarded 35 mm movie film. Data values could be entered from a numeric keyboard, and outputs were displayed on electric lamps. The machine was not a general purpose computer because it lacked looping capabilities. The Z3 was completed in 1941 and was program-controlled.

Once electronic digital computers became a reality, the need to process data in this way was quickly recognized. The first commercial computer to be able to do this in hardware appears to be the Z4
Z4 (computer)

The Z4 computer was the world's second commercial digital computer, designed by Germany engineer Konrad Zuse, built by his company Zuse Apparatebau....
 in 1950, followed by the IBM 704
IBM 704

The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in April, 1954. The 704 was significantly improved over the IBM 701 in terms of architecture as well as implementation, and was not compatible with its predecessor....
 in 1954. For some time after that, floating-point hardware was an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. All modern general-purpose computers have this ability. The PDP-11/44
PDP-11/44

The PDP-11/44, introduced in 1980, was the last of the Digital Equipment Corporation PDP-11 series of minicomputers implemented in discrete logic....
 was an extension of the 11/34 that included the cache memory and floating-point units as a standard feature.

The UNIVAC 1100/2200 series
UNIVAC 1100/2200 series

The UNIVAC 1100/2200 series is a series of compatible 36-bit computer systems, beginning with the UNIVAC 1107 in 1962, initially made by UNIVAC....
, introduced in 1962, supported two floating-point formats. Single precision used 36 bits, organized into a 1-bit sign, 8-bit exponent, and a 27-bit significand. Double precision used 72 bits organized as a 1-bit sign, 11-bit exponent, and a 60-bit significand. The IBM 7094, introduced the same year, also supported single and double precision, with slightly different formats.

Prior to the IEEE-754 standard, computers used many different forms of floating-point. These differed in the word-sizes, the format of the representations, and the rounding behavior of operations. These differing systems implemented different parts of the arithmetic in hardware and software, with varying accuracy.

The IEEE-754 standard was created in the early 1980s, after word sizes of 32 bits (or 16 or 64) had been generally settled upon. Among the innovations are these:
  • A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.
  • A precisely specified behavior of the arithmetic operations. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.
  • The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.


Implementation in actual computers: IEEE floating-point


The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM mainframes, which support IBM's own format
IBM Floating Point Architecture

IBM System/360 computers, and subsequent machines based on that architecture , support a hexadecimal floating point format. The format is used by SAS Transport files as required by the Food and Drug Administration for New Drug Application study submissions....
 (in addition to the IEEE 754 binary and decimal formats), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format

The standard provides for many closely-related formats, differing in only a few details. Five of these formats are called basic formats, and two of these are especially widely used in computer hardware and languages:
  • Single precision
    Single precision

    In computing, single precision is a computer numbering format that occupies one storage location in computer memory at a given address. A single-precision number, sometimes simply a single, may be defined to be an integer, fixed point, or floating point....
    , called "float" in the C
    C (programming language)

    C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
     language family, and "real" or "real*4" in Fortran
    Fortran

    Fortran is a general-purpose programming language, procedural programming language, imperative programming language programming language that is especially suited to numerical analysis and scientific computing....
    . This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
  • Double precision
    Double precision

    In computing, double precision is a computer numbering format that occupies two adjacent storage locations in computer memory. A double precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point....
    , called "double" in the C language family, and "double precision" or "real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).


The other basic formats are "quad" (128-bit) binary and decimal floating-point, and "double" (64-bit) decimal floating-point.

Less common formats include:
  • Extended precision
    Extended precision

    The term "extended precision" refers to storage formats for floating point numbers taking advantage of an opportunity not falling in to a regular sequence of single, double and quadruple precision such as 32-bit, 64-bit and 128-bit occupying two, four or eight 16-bit storage words or similar accountancy....
     format, 80-bit floating point value. Usually "long double" in C language family.
  • Half
    Half precision

    In computing, half precision is a computer numbering format that occupies only half of one storage location in computer memory at some address....
    , also called float16, a 16-bit floating point value.


Any integer less than or equal to 224 can be exactly represented in the single precision format, and any integer less than or equal to 253 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented. This property is sometimes used in purely integer applications, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers.

The bit representations of IEEE binary floating-point numbers are monotonic (increasing or decreasing in accordance with the numbers they represent), as long as exceptional values are avoided and the signs are handled properly. IEEE binary floating-point numbers are equal if and only if their integer bit representations are equal. Binary floating-point comparisons can therefore be done with simple integer comparisons on the bit patterns, as long as the signs match. However, the actual floating-point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.

To a rough approximation, the bit representation of an IEEE binary floating-point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.

Although the 32 bit ("single") and 64 bit ("double") formats are by far the most common, the standard actually allows for many different precision levels. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit extended precision
Extended precision

The term "extended precision" refers to storage formats for floating point numbers taking advantage of an opportunity not falling in to a regular sequence of single, double and quadruple precision such as 32-bit, 64-bit and 128-bit occupying two, four or eight 16-bit storage words or similar accountancy....
 format, with a 15 bit exponent, a 64 bit significand, and no hidden bit.

There is controversy about the failure of most programming languages to make these extended precision formats available to programmers (although C
C (programming language)

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
 and related programming languages usually provide these formats via the long double
Long double

In C programming language and related programming languages, long double refers to a floating point data type that is often more precise than double precision....
 type on such hardware). System vendors may also provide additional extended formats (e.g. 128 bits) emulated in software.

A project for revising the IEEE 754 standard was started in 2000 (see IEEE 754 revision
IEEE 754 revision

This article describes the revision process of the IEEE 754 standard, 2000-2008, and the changes included in the revision. For a description of the standard itself, see IEEE 754-2008....
); it was completed and approved in June 2008. It includes decimal floating-point formats and a 16 bit floating point format ("binary16"). binary16 has the same structure and rules as the older formats, with 1 sign bit, 5 exponent bits and 10 trailing significand bits. It is being used in the NVIDIA Cg graphics language, and in the openEXR standard.

Internal representation

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand (mantissa), from left to right. For the IEEE 754 binary formats they are apportioned as follows:

TypeSignExponentExponent biassignificandtotal
Half
Half precision

In computing, half precision is a computer numbering format that occupies only half of one storage location in computer memory at some address....
 (IEEE 754r
IEEE 754r

The IEEE Standard for Floating-Point Arithmetic is the most widely-used standard for floating point computation, and is followed by many hardware and software implementations....
)
15151016
Single181272332
Double11110235264
Quad11516383112128


While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s and all 1s in this field are reserved for special treatment (see dealing with exceptional cases). Therefore the legal exponent range for normalized numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad.

As described earlier, when a binary number is normalized the leftmost bit of the significand is known to be 1. In the IEEE binary interchange formats that bit is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has a significand with 24 bits of precision, double precision format has 53, and quad has 113.

For example, it was shown above that p, rounded to 24 bits of precision, has:
  • sign = 0 ; e = 1 ; s = 110010010000111111011011 (including the hidden bit)
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as
  • 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB in hexadecimal
    Hexadecimal

    In mathematics and computer science, hexadecimal is a numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 09 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen....


Alternative computer representations for non-integral numbers

Floating-point representation, in particular the standard IEEE format, is by far the most common way of representing arbitrary real numbers in computers because it is efficiently handled in most large computer processors. However, there are alternatives:
  • Fixed-point
    Fixed-point arithmetic

    In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after the radix point . Fixed-point number representation can be compared to the more complicated floating point number representation....
     representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating-point and is also commonly used to perform integer operations. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
  • Binary-coded decimal
    Binary-coded decimal

    In computing and electronics systems, binary-coded decimal is an encoding for decimal numbers in which each digit is represented by its own binary sequence....
  • Where greater precision is desired, floating-point arithmetic can be emulated in software with variable-sized significands which might grow and shrink as the program runs. This is called arbitrary-precision
    Arbitrary-precision arithmetic

    In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, is a technique whereby computer programs perform calculations on integers or rational numbers with an arbitrary number of numerical digits of precision , typically limited only by the available memory of the host system....
    , or "scaled bignum", arithmetic.
  • Some numbers (e.g., 1/3 and 0.1) cannot be represented exactly in binary floating-point no matter what the precision. Software packages that perform rational arithmetic
    Fraction (mathematics)

    A fraction is a number that can represent part of a whole.The earliest fractions were reciprocals of integers, symbols representing one half, one third, one quarter, and so on....
     represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers.
  • Computer algebra system
    Computer algebra system

    A computer algebra system is a Application software that facilitates symbolic mathematics. The core functionality of a CAS is manipulation of mathematical expressions in symbolic form....
    s such as Mathematica
    Mathematica

    Mathematica is a computational software program used widely in scientific, engineering, and mathematical fields and other areas of technical computing....
    , Maxima and Maple can often handle irrational numbers like or in a completely "formal" way, without dealing with a specific encoding of the significand. Such programs can evaluate expressions like "" exactly, because they "know" the underlying mathematics.
  • A representation based on natural logarithm
    Natural logarithm

    The natural logarithm, formerly known as the hyperbolic logarithm, is the logarithm to the base e , where e is an irrational number constant approximately equal to 2.718281828....
    s is sometimes used in FPGA-based applications where most arithmetic operations are multiplication or division. Like floating-point representation, this solution has precision for smaller numbers, as well as a wide range.


Representable numbers, conversion and rounding


By their nature, all numbers expressed in floating-point format are rational number
Rational number

In mathematics, a rational number is a number which can be expressed as a quotient of two integers. Non-integer rational numbers are usually written as the vulgar fraction , where b is not 0 ....
s with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as p
P

P is the sixteenth letter of the modern Latin alphabet. Its name in English language is pronounced pee ....
 or v2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the number 123456789 clearly cannot be exactly represented if only eight decimal digits of precision are available.

When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value to the original, and the value thus adjusted is called the rounded value.

Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers which appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:
e = −??; s = 1100110011001100110011001100110011...,
where, as previously, s is the significand and e is the exponent.

When rounded to 24 bits this becomes
e = −27; s = 110011001100110011001101,
which is actually 0.100000001490116119384765625 in decimal.

As a further example, the real number p
P

P is the sixteenth letter of the modern Latin alphabet. Its name in English language is pronounced pee ....
, represented in binary as an infinite series of bits is
11.0010010000111111011010101000100010000101101000110000100011010011...
but is
11.0010010000111111011011
when approximated by rounding
Rounding

Rounding involves reducing the number of significant digits in a number. The result of rounding is a "shorter" number having fewer non-zero digits yet similar in magnitude....
 to a precision of 24 bits.

In binary single-precision floating-point, this is represented as s = 110010010000111111011011 with e = −22. This has a decimal value of
3.1415927410125732421875,
whereas the more accurate approximation of the true value of p is
3.1415926535897932384626433832795...


The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of p in the first 7 digits. The difference is the discretization error
Discretization error

In numerical analysis, computational physics, and simulation, discretization error is error resulting from the fact that a function of a continuum variable is represented in the computer by a finite number of evaluations, for example, on a lattice ....
 and is limited by the machine epsilon
Machine epsilon

In floating point arithmetic, the machine epsilon is, for a particular floating point unit, the difference between 1 and the smallest exactly representable number greater than one....
.

The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called an Unit in the Last Place
Unit in the Last Place

In computer science, Unit in the Last Place or Unit of Least Precision is the gap between two very close floating-point numbers. To be exact, ulp is the gap between the two floating-point numbers closest to the value x....
 (ULP). For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. For numbers with an exponent of 0, an ULP is exactly 2-23 or about 10-7 in single precision, and about 10-16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of an ULP.

Rounding modes


Rounding modes are used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more significant digits than there are digits in the significand. There are several different rounding schemes
Rounding

Rounding involves reducing the number of significant digits in a number. The result of rounding is a "shorter" number having fewer non-zero digits yet similar in magnitude....
 (or rounding modes). Historically, truncation
Truncation

In mathematics, truncation is the term for limiting the number of numerical digits right of the decimal point, by discarding the least significant ones....
 was the typical approach. Since the introduction of IEEE 754, the default method (round to nearest, ties to even
Rounding

Rounding involves reducing the number of significant digits in a number. The result of rounding is a "shorter" number having fewer non-zero digits yet similar in magnitude....
, sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In the case of a tie, the value that would make the significand end in a 0 bit is chosen. This IEEE standard applies to all fundamental algebraic operations, including square root, in the absence of exceptional conditions. It means that IEEE-compliant hardware behavior is completely determined in all 32 or 64 bits. ("Library" functions such as cosine and log are not mandated.)

Alternative rounding options are also available. IEEE-754-compliant hardware offers the following rounding modes:
  • round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
  • round to nearest, where ties round away from zero (optional for binary floating-point and commonly used in decimal)
  • round up (toward +8; negative results round toward zero)
  • round down (toward -8; negative results round away from zero)
  • round toward zero (sometimes called "chop" mode; it is similar to the common behavior of float-to-integer conversions, which convert -3.9 to -3)


Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and interval arithmetic
Interval arithmetic

Interval arithmetic, interval mathematics, interval analysis, or interval computation, is a method developed by mathematicians since the 1950s and 1960s as an approach to putting bounds on rounding errors in numerical analysis and thus developing numerical methods that yield very reliable results....
.

A further use of rounding modes is when a number is explicitly rounded to a certain number of decimal (or binary) places, as when rounding a result to euros and cents (two decimal places). In this case a common rounding mode is again "round to nearest, ties away from zero", in which a tie is rounded up for positive values.

Floating-point arithmetic operations


For ease of presentation and understanding, decimal radix
Radix

In numeral system, the base or radix is usually the number of unique Numerical digit, including zero, that a Positional notation numeral system uses to represent numbers....
 with 7 digit precision will be used in the examples, as in the IEEE 754 decimal32 format. The fundamental principles are the same in any radix
Radix

In numeral system, the base or radix is usually the number of unique Numerical digit, including zero, that a Positional notation numeral system uses to represent numbers....
 or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, s denotes the significand and e denotes the exponent.

Addition and subtraction

A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits, and we then proceed with the usual addition method:

123456.7 = 1.234567 * 10^5 101.7654 = 1.017654 * 10^2 = 0.001017654 * 10^5

Hence: 123456.7 + 101.7654 = (1.234567 * 10^5) + (1.017654 * 10^2) = (1.234567 * 10^5) + (0.001017654 * 10^5) = (1.234567 + 0.001017654) * 10^5 = 1.235584654 * 10^5

In detail:

e=5; s=1.234567 (123456.7) + e=2; s=1.017654 (101.7654)

e=5; s=1.234567 + e=5; s=0.001017654 (after shifting) -------------------- e=5; s=1.235584654 (true sum: 123558.4654)

This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is e=5; s=1.235585 (final sum: 123558.5)

Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error
Round-off error

A round-off error, also called rounding error, is the difference between the calculated approximation of a number and its exact mathematical value....
. In extreme cases, the sum of two non-zero numbers may be equal to one of them:

e=5; s=1.234567 + e=-3; s=9.876543

e=5; s=1.234567 + e=5; s=0.00000009876543 (after shifting) ---------------------- e=5; s=1.23456709876543 (true sum) e=5; s=1.234567 (after rounding/normalization)

Another problem of loss of significance occurs when two close numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are representations of the rationals 123457.1467 and 123456.659.

e=5; s=1.234571 - e=5; s=1.234567 ---------------- e=5; s=0.000004 e=-1; s=4.000000 (after rounding/normalization)

The best representation of this difference is e = −1; s = 4.877000, which differs more than 20% from e = −1; s = 4.000000. In extreme cases, the final result may be zero even though an exact calculation may be several million. This cancellation
Loss of significance

Loss of significance is an undesirable effect in calculations using floating point arithmetic. It occurs when an operation on two numbers increases relative error substantially more than it increases absolute error, for example in subtracting two large and nearly equal numbers....
 illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis
Numerical analysis

Numerical analysis is the study of algorithms for the problems of continuous mathematics .One of the earliest mathematical writings is the Babylonian tablet YBC 7289, which gives a sexagesimal numerical approximation of , the length of the diagonal in a unit square....
; see also Accuracy problems.

Multiplication

To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

e=3; s=4.734612 × e=5; s=5.417242 ----------------------- e=8; s=25.648538980104 (true product) e=8; s=25.64854(after rounding) e=9; s=2.564854(after normalization)

Division is done similarly, but is more complicated.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex (see Booth's multiplication algorithm
Booth's multiplication algorithm

Booth's multiplication algorithm is a multiplication algorithm that multiplies two signed base 2 numbers in two's complement....
 and digital division
Division (digital)

Several algorithms exist to perform division in digital designs. These algorithms fall into two main categories: slow division and fast division....
). For a fast, simple method, see the Horner method
Horner scheme

In numerical analysis, the Horner scheme or Horner algorithm, named after William George Horner, is an algorithm for the efficient evaluation of polynomials in Monomial basis....
.

Dealing with exceptional cases

Floating-point computation in a computer can run into three kinds of problems:
  • An operation can be mathematically illegal, such as division by zero.
  • An operation can be legal in principle, but not supported by the specific format, for example, calculating the square root of −1 or the inverse sine of 2 (both of which result in complex number
    Complex number

    In mathematics, the complex numbers are an extension of the real numbers obtained by adjoining an imaginary unit, denoted i, which satisfies:...
    s).
  • An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow
    Arithmetic overflow

    The term arithmetic overflow or simply overflow has the following meanings.# In a digital computer, the condition that occurs when a calculation produces a result that is greater in magnitude than what a given processor register or Computer storage location can store or represent....
     (exponent too large) or underflow
    Arithmetic underflow

    Arithmetic underflow is a condition that can occur when the result of a floating-point operation would be smaller in magnitude than the smallest quantity representable....
     (exponent too small).
Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of trap
Trap (computing)

In computing and operating systems, a trap is a type of synchronization interrupt typically caused by an exception handling condition in a user process ....
 that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable
Porting

In computer science, porting is the process of adapting software so that an executable Computer program can be created for a computing environment that is different from the one for which it was originally designed ....
. Modern IEEE-compliant systems have a uniform way of handling these situations. An important part of the mechanism involves error values that result from a failing computation, and that can propagate silently through subsequent computation until they are detected at a point of the programmer's choosing.

The two error values are "infinity" (often denoted "INF"), and "NaN
NaN

In computing, NaN, which stands for Not a Number, is a value or symbol that is usually produced as the result of an operation on invalid input operands, especially in floating point calculations....
" ("not a number"), which covers all other errors. "Infinity" does not necessarily mean that the result is actually infinite. It simply means "too large to represent".

Both of these are encoded with the exponent field set to all 1s. (Recall that exponent fields of all 0s or all 1s are reserved for special meanings.) The significand field is set to something that can distinguish them—typically zero for INF and nonzero for NaN. The sign bit is meaningful for INF, that is, floating-point hardware distinguishes between + and -.

When a nonzero number is divided by zero (the divisor must be exactly zero), a "zerodivide" event occurs, and the result is set to infinity of the appropriate sign. In other cases in which the result's exponent is too large to represent, such as division of an extremely large number by an extremely small number, an "overflow" event occurs, also producing infinity of the appropriate sign. This is different from a zerodivide, though both produce a result of infinity, and the distinction is usually unimportant in practice.

Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as
  • (+INF) + (+7) = (+INF)
  • (+INF) × (−2) = (−INF)
  • But: (+INF) × 0 = NaN—there is no meaningful thing to do


When the result of an operation has an exponent too small to represent properly, an "underflow" event occurs. The hardware responds to this by changing to a format in which the significand is not normalized, and there is no "hidden" bit—that is, all bits of the significand are represented. The exponent field is set to the reserved value of zero. The significand is set to whatever it has to be in order to be consistent with the exponent. Such a number is said to be "subnormal
Denormal number

In computer science, denormal numbers or denormalized numbers fill the gap around 0 in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is 'sub-normal'....
" (sometimes called a "denormalized number", or a "denorm" for short, referring to the older term for subnormals that was used in the first first IEEE 754 standard where subnormal numbers were always unnormalized). Subnormal numbers are valid operands to arithmetic operations.

If no significant bits are able to appear in the significand field, the number is zero. Note that, in this case, the exponent field and significand field are all zeros—floating-point zero is represented by all zeros.

The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.

Other errors, such as division of zero by zero, or taking the square root of −1, cause an "operand error" (known in IEEE 754 as an "invalid operation") event, and produce a NaN result. NaNs propagate aggressively through arithmetic operations—any NaN operand to almost any operation produces a NaN result.

In summary, there are five special "events" that may occur, though some of them are quite benign:
  • An overflow occurs as described previously, producing an infinity
  • An underflow occurs as described previously, producing a subnormal number or zero
  • A zerodivide occurs as described previously, producing an infinity of the appropriate sign
  • An "operand error" ("invalid operation") occurs as described previously, producing a NaN
  • An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.


Computer hardware is typically able to raise exceptions
Exception handling

Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of exceptions - special conditions that change the normal flow of execution....
 when these events occur. How this is done is system-dependent. Usually these exceptions are all masked (disabled) by software, and programs rely on the propagation of error values.

Accuracy problems


The fact that floating-point numbers cannot faithfully mimic the real numbers, and that floating-point operations cannot faithfully mimic true arithmetic operations, leads to many surprising situations. This is related to the finite precision
Precision (computer science)

In computer science, precision of a numerical quantity is a measure of the detail in which the quantity is expressed. This is usually measured in bits, but sometimes in decimal digits....
 with which computers generally represent numbers.

For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as e = −4; s = 110011001100110011001101, which is
0.100000001490116119384765625 exactly.
Squaring this number gives
0.010000000298023226097399174250313080847263336181640625 exactly.
Squaring it with single-precision floating-point hardware (with rounding) gives
0.010000000707805156707763671875 exactly.
But the representable number closest to 0.01 is
0.009999999776482582092285156250 exactly.


Also, the non-representability of p (and p/2) means that an attempted computation of tan(p/2) will not yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(p/2), because p/2 cannot be represented exactly. This computation in C: // Enough digits to be sure we get the correct approximation. double pi = 3.1415926535897932384626433832795; double z = tan(pi/2.0); will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be −22877332.0.

By the same token, an attempted computation of sin(p) will not yield zero. The result will be (approximately) 0.1225 in double precision, or −0.8742 in single precision.

While floating-point addition and multiplication are both commutative (a + b = b + a and a×b = b×a), they are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit decimal arithmetic: 1234.567 + 45.67844 = 1280.245 1280.245 + 0.0004 = 1280.245 but 45.67840 + 0.00004 = 45.67844 45.67844 + 1234.567 = 1280.246 They are also not necessarily distributive. That is, (a + b) ×c may not be the same as a×c + b×c: 1234.567 × 3.333333 = 4115.223 1.234567 × 3.333333 = 4.115223 4115.223 + 4.115223 = 4119.338 but 1234.567 + 1.234567 = 1235.802 1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as p and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:
  • Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. This is perhaps the most common and serious accuracy problem.
  • Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and ceiling functions may produce answers which are off by one from the intuitively expected value.
  • Limited exponent range: results might overflow yielding infinity, or underflow yielding a denormal value or zero. If a denormal number
    Denormal number

    In computer science, denormal numbers or denormalized numbers fill the gap around 0 in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is 'sub-normal'....
     results, precision will be lost.
  • Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
  • Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values. Programmers often perform comparisons within some tolerance (often a decimal constant, itself not accurately represented), but that doesn't necessarily make the problem go away.


Machine Precision

Machine Precision is a quantity that characterizes the accuracy of a floating point system. It is also known as unit roundoff or machine epsilon. Usually denoted Emach, its value depends on the particular rounding being used.

With rounding to zero,
Emach = B^(1-P)
whereas rounding to nearest,
Emach = (1/2)*B^(1-P)

This is important since it bounds the relative error in representing any non-zero real number x within the normalized range of a floating point system:
Emach

Minimizing the effect of accuracy problems

Because of the issues noted above, naive use of floating-point arithmetic can lead to many problems. The creation of thoroughly robust floating-point software is a complicated undertaking, and a good understanding of numerical analysis
Numerical analysis

Numerical analysis is the study of algorithms for the problems of continuous mathematics .One of the earliest mathematical writings is the Babylonian tablet YBC 7289, which gives a sexagesimal numerical approximation of , the length of the diagonal in a unit square....
 is essential.

In addition to careful design of programs, careful handling by the compiler
Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
 is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.

Floating-point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io
Io (moon)

'Io' is the innermost of the four Galilean moons natural satellite of Jupiter and, with a diameter of 3,642 Kilometre, the List of moons by diameter in the Solar System....
 or the mass of the proton
Proton

The proton is a subatomic particle with an electric charge of +1 elementary charge. It is found in the nucleus of each atom but is also stable by itself and has a second identity as the hydrogen ion, H+....
), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. The "decimal" data type of the C# programming language, and the IEEE 854 standard, are designed to avoid the problems of binary floating-point representation, and make the arithmetic always behave as expected when numbers are printed in decimal.

Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.

Expectations from mathematics may not be realised in the field of floating-point computation. For example, it is known that , and that . These facts cannot be counted on when the quantities involved are the result of floating-point computation.

A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to the references at the bottom of this article. Descriptions of a few simple techniques follow.

The use of the equality test (if (x

y) ...) is usually not recommended when expectations are based on results from pure mathematics. Such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) ...), where epsilon is sufficiently small and tailored to the application, such as 1.0E-13 - see machine epsilon
Machine epsilon

In floating point arithmetic, the machine epsilon is, for a particular floating point unit, the difference between 1 and the smallest exactly representable number greater than one....
). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.

An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000. A typical addition would then be something like 3253.671 + 3.141276 -------- 3256.812 The low 3 digits of the addends are effectively lost. The Kahan summation algorithm
Kahan summation algorithm

In numerical analysis, the Kahan summation algorithm significantly reduces the numerical error in the total obtained by adding a sequence of finite decimal precision floating point numbers, compared to the obvious approach....
 may be used to reduce the errors.

Computations may be rearranged in a way that is mathematically equivalent but less prone to error. As an example, Archimedes
Archimedes

Archimedes of Syracuse was a Greek mathematics, physicist, engineer, inventor, and astronomer. Although few details of his life are known, he is regarded as one of the leading scientists in classical antiquity....
 approximated p by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is:

Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:

i 6 × 2i × ti, first form 6 × 2i × ti, second form

0 .4641016151377543863 .4641016151377543863 1 .2153903091734710173 .2153903091734723496 2 596599420974940120 596599420975006733 3 60862151314012979 60862151314352708 4 27145996453136334 27145996453689225 5 8730499801259536 8730499798241950 6 6627470548084133 6627470568494473 7 6101765997805905 6101766046906629 8 70343230776862 70343215275928 9 37488171150615 37487713536668 10 9278733740748 9273850979885 11 7256228504127 7220386148377 12 717412858693 707019992125 13 189011456060 78678454728 14 717412858693 46593073709 15 19358822321783 8571730119 16 717412858693 6566394222 17 810075796233302 6065061913 18 717412858693 939728836 19 4061547378810956 908393901 20 05434924008406305 900560168 21 00068646912273617 8608396 22 349453756585929919 8122118 23 00068646912273617 95552 24 .2245152435345525443 68907 25 62246 26 62246 27 62246 28 62246 The true value is

While the two forms of the recurrence formula are clearly equivalent, the first subtracts 1 from a number extremely close to 1, leading to huge cancellation errors. Note that, as the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

See also



External links

An edited reprint of the paper , by David Goldberg, published in the March, 1991 issue of Computing Surveys. Donald Knuth
Donald Knuth

Donald Ervin Knuth is a renowned computer science and Emeritus of the Art of Computer Programming at Stanford University.Author of the seminal multi-volume work The Art of Computer Programming , Knuth has been called the "father" of the run-time analysis, contributing to the development of, and systematizing formal mathematical techn...
. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89684-2. Section 4.2: Floating Point Arithmetic, pp.214–264. Press et al. Numerical Recipes
Numerical Recipes

Numerical Recipes is the generic title of an influential series of books on algorithms and numerical analysis, all by William Press, Saul Teukolsky, William Vetterling and Brian Flannery:...
 in C++
C++

C++ is a general-purpose programming language. It is regarded as a middle-level language, as it comprises a combination of both high-level programming language and low-level programming language language features....
. The Art of Scientific Computing,
ISBN 0-521-75033-4. Kahan, William and Darcy, Joseph (2001). How Java’s floating-point hurts everyone everywhere. Retrieved September 5, 2003 from . This page gives a very brief summary of floating-point formats that have been used over the years. , by David Monniaux, also printed in ACM
Association for Computing Machinery

The Association for Computing Machinery, or ACM, was founded in 1947 as the world's first scientific and educational computing society. Its membership was approximately 83,000 as of 2007....
 Transactions on programming languages and systems (TOPLAS)
, May 2008: a compendium of non-intuitive behaviours of floating-point on popular architectures, with implications for program verification and testing The www.opencores.org website contains open source floating point IP cores for the implementation of floating point operators in FPGA or ASIC devices. The project, double_fpu, contains verilog source code of a double precision floating point unit. The project, fpuvhdl, contains vhdl source code of a single precision floating point unit.