Easy Accurate Reading and Writing of Floating-Point Numbers


Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers. These algorithms are much simpler than those previously published. The values are stable under repeated conversions between the formats. The scientific representations generated have only the minimum number of mantissa digits needed to convert back to the original binary values.


In both How to Read Floating-Point Numbers Accurately[1990] and Printing floating-point numbers quickly and accurately[1996] the key issue is that successive rounding operations do not have the same effect as a single rounding operation.

The algorithms from both papers are iterative and complicated. The read and write algorithms presented here do at most 2 and 4 bignum divisions, respectively. Over the range of IEEE-754 double-precision numbers, the largest intermediate bignum is 339 decimal digits (1126 bits). This is not large for bignums, being orders of magnitude smaller than the precisions which get speed benefits from FFT multiplication.


Both reading and writing of floating-point numbers can involve division of numbers larger than can be stored in the floating-point registers, causing rounding at undesired points of the conversion. Bignums (arbitrary precision integers) can perform division of large integers without rounding. What is needed is a bignum division-with-rounding operator, called round-quotient here. It can be implementated in Scheme as follows:

(define (round-quotient num den)
  (define quo (quotient num den))
  (if ((if (even? quo) > >=) (* (abs (remainder num den)) 2) (abs den))
      (+ quo (if (eqv? (negative? num) (negative? den)) 1 -1))

If the remainder is more than half of the denominator, then it rounds up; if it is less, then it rounds down; if it is equal, then it rounds to even. These are the same rounding rules as IEEE Standard for Binary Floating-Point Arithmetic (754-1985) and the Scheme procedure round.

The algorithms presented here use the integer-length function from Common-Lisp and the SLIB Scheme Library:

— Function: integer-length n

Returns the number of bits neccessary to represent n.


        (integer-length #b10101010)
        ⇒ 8
        (integer-length 0)
        ⇒ 0
        (integer-length #b1111)
        ⇒ 4

For positive argument n, integer-length returns ⌈log2 n⌉.

The algorithms also use ldexp (from C). Here is Scheme code for ldexp:

;; ldexp() from C.
(define (ldexp bmant bexp)
  (define DBL_MIN_10_EXP -1074)
  (if (> DBL_MIN_10_EXP (* -2 (abs bexp)))
      (* (* 1. bmant (expt 2 (quotient bexp 2)))
	 (expt 2 (- bexp (quotient bexp 2))))
      (* 1. bmant (expt 2 bexp))))

The conditional in ldexp is in order to work with bexp values where 2bexp is out of floating-point range.


The algorithm to convert a positive integer mantissa and integer exponent (of 10) to a floating-point number is straightforward.

;; dbl-mant-dig is the number of bits in the mantissa field of the
;; floating-point format.
(define dbl-mant-dig 53)

;; Given positive integer mantissa MANT and exponent (of 10) POINT,
;; find the closest binary floating-point number.
;; bex is the binary shift for quo=mant/scl; bex is negative.
(define (mantexp->dbl mant point)
  (if (>= point 0)
      (exact->inexact (* mant (expt 10 point)))
      (let* ((scl (expt 10 (- point)))
	     (bex (- (integer-length mant) (integer-length scl) dbl-mant-dig))
	     (num (* mant (expt 2 (- bex))))
	     (quo (round-quotient num scl)))
	(cond ((> (integer-length quo) dbl-mant-dig) ;too many bits of quotient
	       (set! bex (+ 1 bex))
	       (set! quo (round-quotient num (* scl 2)))))
	(ldexp (exact->inexact quo) bex))))

When point is non-negative, the mantissa is multiplied by 10point and the product is rounded to fit in dbl-mant-dig bits by exact->inexact.

With a negative point, the mantissa will be multiplied by a power of 2, then divided by scl=10point. Over the floating-point range, the longest a rounded quotient of a n bit number and a m bit power-of-10 can be is 1+n−m bits; the shortest is n−m bits.

2bex is the power-of-two multiplier. The initial value of bex corresponds to the n−m case above. If the round-quotient is more than dbl-mant-dig bits long, then call round-quotient with double the denominator, 2⋅scl. In either case, the final step is to convert to floating-point using ldexp.


The algorithm for writing a floating-point number is more complicated because it must generate the shortest decimal mantissa which reads as the original floating-point input.

The approximate number of decimal digits needed to represent a binary number is obtained by multiplying dbl-mant-dig by log210. Try to convert using the floor of this number and test to see whether it would read as the original number. If so, then return; if not, then try with one more decimal digit.

(define (exact-ceiling x) (inexact->exact (ceiling x)))

;; Given positive integer mantissa and exponent (of 2) E2, find the
;; closest integer and exponent (of 10).
(define (dbl->string mant e2)
  (define llog2 (/ (log 2) (log 10)))
  (define f (ldexp mant e2))
  (define quo 0)
  (define point 0)
  (if (> e2 0)
      (let ((num (* mant (expt 2 e2))))
	(set! point
	      (max 0 (exact-ceiling (- (/ (log num) (log 10)) (* dbl-mant-dig llog2)))))
	(let ((den (expt 10 point)))
	  (set! quo (round-quotient num den))
	  (cond ((not (= (mantexp->dbl quo point) f))
		 (set! point (+ -1 point))
		 (set! quo (round-quotient num (quotient den 10)))))))
      (let ((den (expt 2 (- e2))))
	(set! point (exact-ceiling (* e2 llog2)))
	(let ((num (* mant (expt 10 (- point)))))
	  (set! quo (round-quotient num den))
	  (cond ((and (positive? f) (not (= (mantexp->dbl quo point) f)))
		 (set! point (+ -1 point))
		 (set! quo (round-quotient (* num 10) den)))))))
  (let* ((dman (number->string quo)) (lman (string-length dman)))
    (do ((idx (+ -1 lman) (+ -1 idx)))
	((or (zero? idx) (not (eqv? #\0 (string-ref dman idx))))
	 (string-append "." (substring dman 0 (+ 1 idx))
			"e" (number->string (+ point lman)))))))

When e2 is positive, num is bound to the product of mant and 2e2. point is set to an upper-bound of the number of decimal digits of num in excess of the floating-point mantissa's precision. The round-quotient of num and 10point produces the integer quo. If mantexp->dbl of quo and point is not equal to the original floating-point value f, then the round-quotient is computed again with the divisor divided by 10 yielding one more digit of precision.

When e2 is negative, den is bound to 2e2 and point is set to the negation of an upper-bound of the number of decimal digits in 2e2. num is bound to the product of mant and 10point. The round-quotient of num and den produces the integer quo. If mantexp->dbl of quo and point is not equal to the original floating-point value f, then the round-quotient is computed again with num multiplied by 10 yielding one more digit of precision.

The last part of dbl->string produces a string using Scheme's number->string to convert the integer mantissa and exponent components to decimal strings. Mantissa trailing zeros are eliminated by scanning the dman string in reverse for non-zero digits.


Copyright © 2013 Aubrey Jaffer

I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.  My actions and comments do not reflect in any way on MIT.
CS and Numerical Analysis
agj @
Go Figure!