Next  |  Prev  |  Up  |  Top  |  Index  |  JOS Index  |  JOS Pubs  |  JOS Home  |  Search

Floating-Point Numbers

Floating-point numbers consist of an ``exponent,'' ``significand'', and ``sign bit''. For a negative number, we may set the sign bit of the floating-point word and negate the number to be encoded, leaving only nonnegative numbers to be considered. Zero is represented by all zeros, so now we need only consider positive numbers.

The basic idea of floating point encoding of a binary number is to normalize the number by shifting the bits either left or right until the shifted result lies between 1/2 and 1. (A left-shift by one place in a binary word corresponds to multiplying by 2, while a right-shift one place corresponds to dividing by 2.) The number of bit-positions shifted to normalize the number can be recorded as a signed integer. The negative of this integer (i.e., the shift required to recover the original number) is defined as the exponent of the floating-point encoding. The normalized number between 1/2 and 1 is called the significand, so called because it holds all the ``significant bits'' of the number.

Floating point notation is exactly analogous to ``scientific notation'' for decimal numbers, e.g., $ 1.2345\times 10^{-9}$ ; the number of significant digits, 5 in this example, is determined by counting digits in the ``significand'' $ 1.2345$ , while the ``order of magnitude'' is determined by the power of 10 (-9 in this case). In floating-point numbers, the significand is stored in fractional two's-complement binary format, and the exponent is stored as a binary integer.

Since the significand lies in the interval $ [1/2,1)$ ,G.8its most significant bit is always a 1, so it is not actually stored in the computer word, giving one more significant bit of precision.

Let's now restate the above a little more precisely. Let $ x>0$ denote a number to be encoded in floating-point, and let $ \tilde{x}= x\cdot 2^{-E}$ denote the normalized value obtained by shifting $ x$ either $ E$ bits to the right (if $ E>0$ ), or $ \left\vert E\right\vert$ bits to the left (if $ E<0$ ). Then we have $ 1/2 \leq \tilde{x}< 1$ , and $ x = \tilde{x}\cdot 2^E$ . The significand $ M$ of the floating-point representation for $ x$ is defined as the binary encoding of $ \tilde{x}$ .G.9 It is often the case that $ \tilde{x}$ requires more bits than are available for exact encoding. Therefore, the significand is typically rounded (or truncated) to the value closest to $ \tilde{x}$ . Given $ N_M$ bits for the significand, the encoding of $ \tilde{x}$ can be computed by multiplying it by $ 2^{N_M}$ (left-shifting it $ N_M$ bits), rounding to the nearest integer (or truncating toward minus infinity--as implemented by the floor() function), and encoding the $ N_M$ -bit result as a binary (signed) integer.

As a final practical note, exponents in floating-point formats may have a bias. That is, instead of storing $ E$ as a binary integer, you may find a binary encoding of $ E-B$ where $ B$ is the bias.G.10

These days, floating-point formats generally follow the IEEE standards set out for them. A single-precision floating point word is $ 32$ bits (four bytes) long, consisting of $ 1$ sign bit, $ 8$ exponent bits, and $ 23$ significand bits, normally laid out as

    S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM

where S denotes the sign bit, E an exponent bit, and M a significand bit. Note that in this layout, ordinary integer comparison can be used in the hardware.

A double-precision floating point word is $ 64$ bits (eight bytes) long, consisting of $ 1$ sign bit, $ 11$ exponent bits, and $ 52$ significand bits. In the Intel Pentium processor, there is also an extended precision format, used for intermediate results, which is $ 80$ bits (ten bytes) containing $ 1$ sign bit, $ 15$ exponent bits, and $ 64$ significand bits. In Intel processors, the exponent bias is $ 127$ for single-precision floating-point, $ 1023$ for double-precision, and $ 16383$ for extended-precision. The single and double precision formats have a ``hidden'' significand bit, while the extended precision format does not. Thus, the most significant significand bit is always set in extended precision.

The MPEG-4 audio compression standard (which supports compression using music synthesis algorithms) specifies that the numerical calculations in any MPEG-4 audio decoder should be at least as accurate as 32-bit single-precision floating point.


Next  |  Prev  |  Up  |  Top  |  Index  |  JOS Index  |  JOS Pubs  |  JOS Home  |  Search

[How to cite this work]  [Order a printed hardcopy]  [Comment on this page via email]

``Mathematics of the Discrete Fourier Transform (DFT), with Audio Applications --- Second Edition'', by Julius O. Smith III, W3K Publishing, 2007, ISBN 978-0-9745607-4-8
Copyright © 2024-04-02 by Julius O. Smith III
Center for Computer Research in Music and Acoustics (CCRMA),   Stanford University
CCRMA