Floating-point numbers consist of an ``exponent,'' ``significand'', and ``sign bit''. For a negative number, we may set the sign bit of the floating-point word and negate the number to be encoded, leaving only nonnegative numbers to be considered. Zero is represented by all zeros, so now we need only consider positive numbers.

The basic idea of floating point *encoding* of a binary number is to
*normalize* the number by *shifting* the bits either left or right
until the shifted result lies between 1/2 and 1. (A left-shift by one
place in a binary word corresponds to multiplying by 2, while a right-shift
one place corresponds to dividing by 2.) The number of bit-positions
shifted to normalize the number can be recorded as a signed integer. The
negative of this integer (*i.e.*, the shift required to recover the original
number) is defined as the *exponent* of the floating-point encoding.
The normalized number between 1/2 and 1 is called the *significand*,
so called because it holds all the ``significant bits'' of the number.

Floating point notation is exactly analogous to ``scientific notation'' for
decimal numbers, *e.g.*,
; the number of significant
digits, 5 in this example, is determined by counting digits in the
``significand''
, while the ``order of magnitude'' is determined by
the power of 10 (-9 in this case). In floating-point numbers, the
significand is stored in fractional two's-complement binary format, and the
exponent is stored as a binary integer.

Since the significand lies in the interval
,^{G.6}its most significant bit is always a 1, so it is not actually stored in the
computer word, giving one more significant bit of precision.

Let's now restate the above a little more precisely. Let
denote a
number to be encoded in floating-point, and let
denote the normalized value obtained by shifting
either
bits to the
right (if
), or
bits to the left (if
). Then we
have
, and
. The *significand*
of the floating-point representation for
is defined as the binary
encoding of
.^{G.7} It is
often the case that
requires more bits than are available for exact
encoding. Therefore, the significand is typically *rounded* (or
truncated) to the value closest to
. Given
bits for the
significand, the encoding of
can be computed by multiplying it by
(left-shifting it
bits), *rounding* to the nearest
integer (or *truncating* toward minus infinity--as implemented by
the `floor()` function), and encoding the
-bit result as a binary
(signed) integer.

As a final practical note, exponents in floating-point formats may have a
*bias*. That is, instead of storing
as a binary integer, you may
find a binary encoding of
where
is the bias.^{G.8}

These days, floating-point formats generally follow the IEEE standards set out for them. A single-precision floating point word is bits (four bytes) long, consisting of sign bit, exponent bits, and significand bits, normally laid out as

S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM

where S denotes the sign bit, E an exponent bit, and M a significand bit. Note that in this layout, ordinary integer comparison can be used in the hardware.

A double-precision floating point word is
bits (eight bytes) long,
consisting of
sign bit,
exponent bits, and
significand bits.
In the Intel Pentium processor, there is also an *extended precision*
format, used for intermediate results, which is
bits (ten bytes)
containing
sign bit,
exponent bits, and
significand bits. In
Intel processors, the exponent bias is
for single-precision
floating-point,
for double-precision, and
for
extended-precision. The single and double precision formats have a
``hidden'' significand bit, while the extended precision format does not.
Thus, the most significant significand bit is always set in extended
precision.

The MPEG-4 audio compression standard (which supports compression using music synthesis algorithms) specifies that the numerical calculations in any MPEG-4 audio decoder should be at least as accurate as 32-bit single-precision floating point.

[How to cite this work] [Order a printed hardcopy] [Comment on this page via email]

Copyright ©

Center for Computer Research in Music and Acoustics (CCRMA), Stanford University