http://people.csail.mit.edu/jaffer/III/EZFPRW | |
Easy Accurate Reading and Writing of Floating-Point Numbers |
Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers. These algorithms are much simpler than those previously published. The values are stable under repeated conversions between the formats. The scientific representations generated have only the minimum number of mantissa digits needed to convert back to the original binary values.
In both How to Read Floating-Point Numbers Accurately[1990] and Printing floating-point numbers quickly and accurately[1996] the key issue is that successive rounding operations do not have the same effect as a single rounding operation.
The algorithms from both papers are iterative and complicated. The read and write algorithms presented here do at most 2 and 4 bignum divisions, respectively. Over the range of IEEE-754 double-precision numbers, the largest intermediate bignum is 339 decimal digits (1126 bits). This is not large for bignums, being orders of magnitude smaller than the precisions which get speed benefits from FFT multiplication.
Both reading and writing of floating-point numbers can involve
division of numbers larger than can be stored in the
floating-point registers, causing rounding at undesired points of
the conversion.
Bignums (arbitrary precision integers) can perform division of large
integers without rounding. What is needed is a bignum
division-with-rounding operator, called round-quotient
here. It can be implementated in Scheme as
follows:
(define (round-quotient num den) (define quo (quotient num den)) (if ((if (even? quo) > >=) (* (abs (remainder num den)) 2) (abs den)) (+ quo (if (eqv? (negative? num) (negative? den)) 1 -1)) quo))
If the remainder is more than half of the denominator, then it
rounds up; if it is less, then it rounds down; if it is equal, then
it rounds to even. These are the same rounding rules as
IEEE Standard for Binary Floating-Point Arithmetic
(754-1985) and the
Scheme
procedure round
.
The algorithms presented here use the integer-length function from Common-Lisp and the SLIB Scheme Library:
Returns the number of bits neccessary to represent n.
Example:
(integer-length #b10101010) ⇒ 8 (integer-length 0) ⇒ 0 (integer-length #b1111) ⇒ 4
For positive argument n, integer-length
returns
⌈log_{2} n⌉.
The algorithms also use ldexp
(from C). Here is Scheme
code for ldexp
:
;; ldexp() from C. (define (ldexp bmant bexp) (define DBL_MIN_10_EXP -1074) (if (> DBL_MIN_10_EXP (* -2 (abs bexp))) (* (* 1. bmant (expt 2 (quotient bexp 2))) (expt 2 (- bexp (quotient bexp 2)))) (* 1. bmant (expt 2 bexp))))
The conditional in ldexp
is in order to work
with bexp values where 2^{bexp} is out
of floating-point range.
The algorithm to convert a positive integer mantissa and integer exponent (of 10) to a floating-point number is straightforward.
;; dbl-mant-dig is the number of bits in the mantissa field of the ;; floating-point format. (define dbl-mant-dig 53) ;; Given positive integer mantissa MANT and exponent (of 10) POINT, ;; find the closest binary floating-point number. ;; bex is the binary shift for quo=mant/scl; bex is negative. (define (mantexp->dbl mant point) (if (>= point 0) (exact->inexact (* mant (expt 10 point))) (let* ((scl (expt 10 (- point))) (bex (- (integer-length mant) (integer-length scl) dbl-mant-dig)) (num (* mant (expt 2 (- bex)))) (quo (round-quotient num scl))) (cond ((> (integer-length quo) dbl-mant-dig) ;too many bits of quotient (set! bex (+ 1 bex)) (set! quo (round-quotient num (* scl 2))))) (ldexp (exact->inexact quo) bex))))
When point is non-negative, the mantissa is multiplied by
10^{point} and the product is rounded to fit
in dbl-mant-dig
bits by exact->inexact
.
With a negative point, the mantissa will be multiplied by a power of 2, then divided by scl=10^{−point}. Over the floating-point range, the longest a rounded quotient of a n bit number and a m bit power-of-10 can be is 1+n−m bits; the shortest is n−m bits.
2^{−bex} is the power-of-two multiplier.
The initial value of bex corresponds to
the n−m case above. If the round-quotient is more
than dbl-mant-dig
bits long, then
call round-quotient
with double the denominator,
2⋅scl. In either case, the final step is to convert
to floating-point using ldexp
.
The algorithm for writing a floating-point number is more complicated because it must generate the shortest decimal mantissa which reads as the original floating-point input.
The approximate number of decimal digits needed to represent a
binary number is obtained by multiplying dbl-mant-dig
by log_{2}10. Try to convert using the floor of this number
and test to see whether it would read as the original number. If
so, then return; if not, then try with one more decimal digit.
(define (exact-ceiling x) (inexact->exact (ceiling x))) ;; Given positive integer mantissa and exponent (of 2) E2, find the ;; closest integer and exponent (of 10). (define (dbl->string mant e2) (define llog2 (/ (log 2) (log 10))) (define f (ldexp mant e2)) (define quo 0) (define point 0) (if (> e2 0) (let ((num (* mant (expt 2 e2)))) (set! point (max 0 (exact-ceiling (- (/ (log num) (log 10)) (* dbl-mant-dig llog2))))) (let ((den (expt 10 point))) (set! quo (round-quotient num den)) (cond ((not (= (mantexp->dbl quo point) f)) (set! point (+ -1 point)) (set! quo (round-quotient num (quotient den 10))))))) (let ((den (expt 2 (- e2)))) (set! point (exact-ceiling (* e2 llog2))) (let ((num (* mant (expt 10 (- point))))) (set! quo (round-quotient num den)) (cond ((and (positive? f) (not (= (mantexp->dbl quo point) f))) (set! point (+ -1 point)) (set! quo (round-quotient (* num 10) den))))))) (let* ((dman (number->string quo)) (lman (string-length dman))) (do ((idx (+ -1 lman) (+ -1 idx))) ((or (zero? idx) (not (eqv? #\0 (string-ref dman idx)))) (string-append "." (substring dman 0 (+ 1 idx)) "e" (number->string (+ point lman)))))))
When e2 is positive, num is bound to the
product of mant and
2^{e2}. point is set to an
upper-bound of the number of decimal digits of num in
excess of the floating-point mantissa's precision.
The round-quotient
of num and
10^{point} produces the integer quo.
If mantexp->dbl
of quo
and point is not equal to the original floating-point
value f, then the round-quotient
is computed
again with the divisor divided by 10 yielding one more digit of
precision.
When e2 is negative, den is bound to
2^{−e2} and point is set to the
negation of an upper-bound of the number of decimal digits in
2^{e2}.
num is bound to the product of mant and
10^{point}. The round-quotient
of num and den produces the
integer quo. If mantexp->dbl
of quo and point is not equal to the original
floating-point value f, then
the round-quotient
is computed again
with num multiplied by 10 yielding one more digit of
precision.
The last part of dbl->string
produces a string using
Scheme's number->string
to convert the integer
mantissa and exponent components to decimal strings. Mantissa
trailing zeros are eliminated by scanning the dman string
in reverse for non-zero digits.
I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.
My actions and comments do not reflect in any way on MIT. | ||
CS and Numerical Analysis | ||
agj @ alum.mit.edu | Go Figure! |