Easy Accurate Reading and Writing of Floating-Point Numbers
Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers. These algorithms are much simpler than those previously published. The values are stable under repeated conversions between the formats. The scientific representations generated have only the minimum number of mantissa digits needed to convert back to the original binary values.
In both How to Read Floating-Point Numbers Accurately and Printing floating-point numbers quickly and accurately the key issue is that successive rounding operations do not have the same effect as a single rounding operation.
The algorithms from both papers are iterative and complicated. The read and write algorithms presented here do at most 2 and 4 bignum divisions, respectively. Over the range of IEEE-754 double-precision numbers, the largest intermediate bignum is 339 decimal digits (1126 bits). This is not large for bignums, being orders of magnitude smaller than the precisions which get speed benefits from FFT multiplication.
Both reading and writing of floating-point numbers can involve
division of numbers larger than can be stored in the
floating-point registers, causing rounding at undesired points of
Bignums (arbitrary precision integers) can perform division of large
integers without rounding. What is needed is a bignum
division-with-rounding operator, called
here. It can be implementated in Scheme as
(define (round-quotient num den) (define quo (quotient num den)) (if ((if (even? quo) > >=) (* (abs (remainder num den)) 2) (abs den)) (+ quo (if (eqv? (negative? num) (negative? den)) 1 -1)) quo))
If the remainder is more than half of the denominator, then it
rounds up; if it is less, then it rounds down; if it is equal, then
it rounds to even. These are the same rounding rules as
IEEE Standard for Binary Floating-Point Arithmetic
(754-1985) and the
The algorithms presented here use the integer-length function from Common-Lisp and the SLIB Scheme Library:
Returns the number of bits neccessary to represent n.
Example:(integer-length #b10101010) ⇒ 8 (integer-length 0) ⇒ 0 (integer-length #b1111) ⇒ 4
For positive argument n,
The algorithms also use
ldexp (from C). Here is Scheme
;; ldexp() from C. (define (ldexp bmant bexp) (define DBL_MIN_10_EXP -1074) (if (> DBL_MIN_10_EXP (* -2 (abs bexp))) (* (* 1. bmant (expt 2 (quotient bexp 2))) (expt 2 (- bexp (quotient bexp 2)))) (* 1. bmant (expt 2 bexp))))
The conditional in
ldexp is in order to work
with bexp values where 2bexp is out
of floating-point range.
The algorithm to convert a positive integer mantissa and integer exponent (of 10) to a floating-point number is straightforward.
;; dbl-mant-dig is the number of bits in the mantissa field of the ;; floating-point format. (define dbl-mant-dig 53) ;; Given positive integer mantissa MANT and exponent (of 10) POINT, ;; find the closest binary floating-point number. ;; bex is the binary shift for quo=mant/scl; bex is negative. (define (mantexp->dbl mant point) (if (>= point 0) (exact->inexact (* mant (expt 10 point))) (let* ((scl (expt 10 (- point))) (bex (- (integer-length mant) (integer-length scl) dbl-mant-dig)) (num (* mant (expt 2 (- bex)))) (quo (round-quotient num scl))) (cond ((> (integer-length quo) dbl-mant-dig) ;too many bits of quotient (set! bex (+ 1 bex)) (set! quo (round-quotient num (* scl 2))))) (ldexp (exact->inexact quo) bex))))
When point is non-negative, the mantissa is multiplied by
10point and the product is rounded to fit
dbl-mant-dig bits by
With a negative point, the mantissa will be multiplied by a power of 2, then divided by scl=10−point. Over the floating-point range, the longest a rounded quotient of a n bit number and a m bit power-of-10 can be is 1+n−m bits; the shortest is n−m bits.
2−bex is the power-of-two multiplier.
The initial value of bex corresponds to
the n−m case above. If the round-quotient is more
dbl-mant-dig bits long, then
round-quotient with double the denominator,
2⋅scl. In either case, the final step is to convert
to floating-point using
The algorithm for writing a floating-point number is more complicated because it must generate the shortest decimal mantissa which reads as the original floating-point input.
The approximate number of decimal digits needed to represent a
binary number is obtained by multiplying
by log210. Try to convert using the floor of this number
and test to see whether it would read as the original number. If
so, then return; if not, then try with one more decimal digit.
(define (exact-ceiling x) (inexact->exact (ceiling x))) ;; Given positive integer mantissa and exponent (of 2) E2, find the ;; closest integer and exponent (of 10). (define (dbl->string mant e2) (define llog2 (/ (log 2) (log 10))) (define f (ldexp mant e2)) (define quo 0) (define point 0) (if (> e2 0) (let ((num (* mant (expt 2 e2)))) (set! point (max 0 (exact-ceiling (- (/ (log num) (log 10)) (* dbl-mant-dig llog2))))) (let ((den (expt 10 point))) (set! quo (round-quotient num den)) (cond ((not (= (mantexp->dbl quo point) f)) (set! point (+ -1 point)) (set! quo (round-quotient num (quotient den 10))))))) (let ((den (expt 2 (- e2)))) (set! point (exact-ceiling (* e2 llog2))) (let ((num (* mant (expt 10 (- point))))) (set! quo (round-quotient num den)) (cond ((and (positive? f) (not (= (mantexp->dbl quo point) f))) (set! point (+ -1 point)) (set! quo (round-quotient (* num 10) den))))))) (let* ((dman (number->string quo)) (lman (string-length dman))) (do ((idx (+ -1 lman) (+ -1 idx))) ((or (zero? idx) (not (eqv? #\0 (string-ref dman idx)))) (string-append "." (substring dman 0 (+ 1 idx)) "e" (number->string (+ point lman)))))))
When e2 is positive, num is bound to the
product of mant and
2e2. point is set to an
upper-bound of the number of decimal digits of num in
excess of the floating-point mantissa's precision.
round-quotient of num and
10point produces the integer quo.
mantexp->dbl of quo
and point is not equal to the original floating-point
value f, then the
round-quotient is computed
again with the divisor divided by 10 yielding one more digit of
When e2 is negative, den is bound to
2−e2 and point is set to the
negation of an upper-bound of the number of decimal digits in
num is bound to the product of mant and
of num and den produces the
integer quo. If
of quo and point is not equal to the original
floating-point value f, then
round-quotient is computed again
with num multiplied by 10 yielding one more digit of
The last part of
dbl->string produces a string using
number->string to convert the integer
mantissa and exponent components to decimal strings. Mantissa
trailing zeros are eliminated by scanning the dman string
in reverse for non-zero digits.
I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.
My actions and comments do not reflect in any way on MIT.|
|CS and Numerical Analysis|
|agj @ alum.mit.edu||Go Figure!|