http://people.csail.mit.edu/jaffer/MIXF/MIXF-10

Representation of numerical values and SI units in character strings for information interchanges

Version	Released	Terms
`MIXF-10`	2011-10-09	RFC

Abstract
Motivation
Relation to Previous Work
Metric Interchange Format
- Alphabetic Case
SI Prefixes
Binary Prefixes
Unit Symbols
Unit Symbols (alphabetical)
Unit Examples
Use of Metric Units by Computer Programs
- Examples
- Programming Language Extension
Rationales
Acknowledgements
References
Metric Interchange Syntax
Voluntocracy License

Abstract

This document describes a character string encoding for numerical values and units which:

is unambiguous in all locales;
uses only "Portable Character Set" [PCS] characters matching "Basic Latin" characters in Plane 0 of the Universal Character Set [UCS];
is transparent to [UTF-7] and [UTF-8] UCS transformation formats;
is human readable and writable;
is machine readable and writable;
incorporates SI prefixes and units;
incorporates [ISO 6093] numbers; and
incorporates [IEC 60027-2] binary prefixes.

Motivation

According to [NASA 1999] Arthur Stephenson, chairman of the Mars Climate Orbiter Mission Failure Investigation Board:

"The 'root cause' of the loss of the spacecraft was the failed translation of English units into metric units in a segment of ground-based, navigation-related mission software, ..."

Although the [ISO 6093] standard for automated interchange of numerical data is widely used, standardized measurement units (other than for page formating) are not routinely attached to interchange data.

Relation to Previous Work

The 1986 standard Representations for U.S. Customary, SI, and Other Units to Be Used in Systems with Limited Character Sets [ANSI X3.50] states:

This standard was not designed for ... usage by humans as input to, or output from, data systems. ... They should never be printed out for publication or for other forms of public information transfer.

[ANSI X3.50] representations of units are ambiguous. "min" is both "minute" and "milliinch"; "cd" is both "candela" and "centiday".

Apart from SI units, [ANSI X3.50] supports only U.S. local units, is not complete in that support, and has no provision for extension to other locales. But non-SI unit systems are in such disarray that using them for interchange is not practical. Unit names signify different volumes in different locales; the Canadian gallon is 4.54609 liters, while the U.S. gallon is 3.785412 liters. The CRC Handbook of Chemistry and Physics [CRC] lists no less than six distinct (incompatible) systems of wire gauges.

The character set limitations targeted by [ANSI X3.50], namely single alphabetic case, are no longer common in data interchanges. But much of its double case "Form I" SI unit representations are similar to those presented here.

The audience for metric standards has changed and grown. In the preface to Guide for the Use of the International System of Units (SI) [NIST 811], B. Taylor writes:

The International System of Units, universally abbreviated SI, is the modern metric system of measurement. Long the dominant measurement system used in science, the SI is becoming the dominant measurement system used in international commerce.

[NIST 811] details a methodology for expressing measurement units in both text and symbolic form in scientific and other documents. Its unit expressions combine over 40 metric base and derived unit symbols unambiguously. Taylor's unit symbols are the basis for this metric interchange format.

Metric Interchange Format

In the expression for the value of a quantity, the unit symbol is placed after the numerical value. A dot (PERIOD, ".") is placed between the numerical value and the unit symbol.

Within a compound unit, each of the base and derived symbols can optionally have an attached SI prefix. The binary prefixes can be used with base units B (byte) and bit.

Unit symbols formed from other unit symbols by multiplication are indicated by means of a dot (PERIOD, ".") placed between them.

Unit symbols formed from other unit symbols by division are indicated by means of a SOLIDUS ("/") or negative exponents. The SOLIDUS must not be repeated in the same compound unit unless contained within a parenthesized subexpression.

The grouping formed by a prefix symbol attached to a unit symbol constitutes a new inseparable symbol (forming a multiple or submultiple of the unit concerned) which can be raised to a positive or negative power and which can be combined with other unit symbols to form compound unit symbols.

The grouping formed by surrounding compound unit symbols with parentheses ("(" and ")") constitutes a new inseparable symbol which can be raised to a positive or negative power and which can be combined with other unit symbols to form compound unit symbols.

Compound prefix symbols, that is, prefix symbols formed by the juxtaposition of two or more prefix symbols, are not permitted.

Prefix symbols are not used with the time-related unit symbols min (minute), h (hour), d (day). No prefix symbol may be used with dB (decibel) or u (unified atomic mass unit). Only submultiple prefix symbols may be used with the unit symbols L (liter), Np (neper), o (degree), oC (degree Celsius), rad (radian), and sr (steradian). Submultiple prefix symbols may not be used with the unit symbols t (metric ton), r (revolution), or Bd (baud).

A unit exponent follows the unit, separated by a CIRCUMFLEX ("^"). Exponents may be positive or negative. Fractional exponents must be parenthesized.

Alphabetic Case

The case of letters in unit symbols must match the symbols specified here. Unit symbols are composed of lower-case letters except that:

the first letter of the symbol is an upper-case letter when the name of the unit is derived from the name of a person; and
the symbol for the liter is L.

The prefix symbols Y (yotta), Z (zetta), E (exa), P (peta), T (tera), G (giga), and M (mega) are printed in upper-case letters while all other prefix symbols are printed in lower-case letters.

SI Prefixes

Factor	Prefix	Symbol
1e1	deka	da
1e2	hecto	h
1e3	kilo	k
1e6	mega	M
1e9	giga	G
1e12	tera	T
1e15	peta	P
1e18	exa	E
1e21	zetta	Z
1e24	yotta	Y

Factor	Prefix	Symbol
1e-1	deci	d
1e-2	centi	c
1e-3	milli	m
1e-6	micro	u
1e-9	nano	n
1e-12	pico	p
1e-15	femto	f
1e-18	atto	a
1e-21	zepto	z
1e-24	yocto	y

Binary Prefixes

These binary prefixes are valid only with the units B (byte) and bit. However, decimal prefixes can also be used with bit; and decimal multiple (not submultiple) prefixes can also be used with B (byte).

Factor	Power-of-2	Name	Symbol
1.024e3	2¹⁰	kibi	Ki
1.048576e6	2²⁰	mebi	Mi
1.073741824e9	2³⁰	gibi	Gi
1.099511627776e12	2⁴⁰	tebi	Ti
1.125899906842624e15	2⁵⁰	pebi	Pi
1.152921504606846976e18	2⁶⁰	exbi	Ei

Unit Symbols

Type of Quantity	Name	Symbol	Equivalent
time	second	s
time	minute	min	= 60`.s`
time	hour	h	= 60`.min`
time	day	d	= 24`.h`
frequency	hertz	Hz	`s^-1`
signaling rate	baud	Bd	`s^-1`
length	meter	m
volume	liter	L	`dm^3`
plane angle	radian	rad
solid angle	steradian	sr	`rad^2`
plane angle	revolution	r	=*6.283185307179586`.rad`
plane angle	degree	o	=*2.777777777777778e-3`.r`
information capacity	bit	bit
information capacity	byte, octet	B	= 8`.bit`
mass	gram	g
mass	ton	t	`Mg`
mass	unified atomic mass unit	u	= 1.660538782e-27`.kg`
amount of substance	mole	mol
catalytic activity	katal	kat	`mol/s`
thermodynamic temperature	kelvin	K
temperature	degree Celsius	oC
luminous intensity	candela	cd
luminous flux	lumen	lm	`cd.sr`
illuminance	lux	lx	`lm/m^2`
force	newton	N	`m.kg.s^-2`
pressure, stress	pascal	Pa	`N/m^2`
energy, work, heat	joule	J	`N.m`
energy	electronvolt	eV	= 1.602176487e-19`.J`
power, radiant flux	watt	W	`J/s`
logarithm of power ratio	neper	Np
logarithm of power ratio	decibel	dB	=*0.1151293`.Np`
electric current	ampere	A
electric charge	coulomb	C	`s.A`
electric potential, EMF	volt	V	`W/A`
capacitance	farad	F	`C/V`
electric resistance	ohm	Ohm	`V/A`
electric conductance	siemens	S	`A/V`
magnetic flux	weber	Wb	`V.s`
magnetic flux density	tesla	T	`Wb/m^2`
inductance	henry	H	`Wb/A`
radionuclide activity	becquerel	Bq	`s^-1`
absorbed dose energy	gray	Gy	`m^2.s^-2`
dose equivalent	sievert	Sv	`m^2.s^-2`

*	The exact formulas are:
	`r/rad`	= 8 * atan(1)
	`o/r`	= 1 / 360
	`db/Np`	= ln(10) / 20

Unit Symbols (alphabetical)

Type of Quantity	Name	Symbol	Equivalent
electric current	ampere	A
information capacity	byte, octet	B	= 8`.bit`
signaling rate	baud	Bd	`s^-1`
information capacity	bit	bit
radionuclide activity	becquerel	Bq	`s^-1`
electric charge	coulomb	C	`s.A`
luminous intensity	candela	cd
time	day	d	= 24`.h`
logarithm of power ratio	decibel	dB	=*0.1151293`.Np`
energy	electronvolt	eV	= 1.602176487e-19`.J`
capacitance	farad	F	`C/V`
mass	gram	g
absorbed dose energy	gray	Gy	`m^2.s^-2`
inductance	henry	H	`Wb/A`
time	hour	h	= 60`.min`
frequency	hertz	Hz	`s^-1`
energy, work, heat	joule	J	`N.m`
thermodynamic temperature	kelvin	K
catalytic activity	katal	kat	`mol/s`
volume	liter	L	`dm^3`
luminous flux	lumen	lm	`cd.sr`
illuminance	lux	lx	`lm/m^2`
length	meter	m
time	minute	min	= 60`.s`
amount of substance	mole	mol
force	newton	N	`m.kg.s^-2`
logarithm of power ratio	neper	Np
plane angle	degree	o	=*2.777777777777778e-3`.r`
temperature	degree Celsius	oC
electric resistance	ohm	Ohm	`V/A`
pressure, stress	pascal	Pa	`N/m^2`
plane angle	revolution	r	=*6.283185307179586`.rad`
plane angle	radian	rad
electric conductance	siemens	S	`A/V`
time	second	s
solid angle	steradian	sr	`rad^2`
dose equivalent	sievert	Sv	`m^2.s^-2`
magnetic flux density	tesla	T	`Wb/m^2`
mass	ton	t	`Mg`
mass	unified atomic mass unit	u	= 1.660538782e-27`.kg`
electric potential, EMF	volt	V	`W/A`
power, radiant flux	watt	W	`J/s`
magnetic flux	weber	Wb	`V.s`

Unit Examples

Most of these are from [NIST 811] - Examples of SI derived units ... and Essentials of the SI: Base & derived units

Type of Quantity	Name	Symbol
area	square meter	`m^2`
volume	cubic meter	`m^3`
speed, velocity	meter per second	`m/s`
acceleration	meter per second squared	`m/s^2`
wave number	reciprocal meter	`m^-1`
mass density (density)	kilogram per cubic meter	`kg/m^3`
specific volume	cubic meter per kilogram	`m^3/kg`
current density	ampere per square meter	`A/m^2`
magnetic field strength	ampere per meter	`A/m`
concentration	mole per cubic meter	`mol/m^3`
luminance	candela per square meter	`cd/m^2`
angular velocity	radian per second	`rad/s`
angular acceleration	radian per second squared	`rad/s^2`
dynamic viscosity	pascal second	`Pa.s`
moment of force	newton meter	`N.m`
surface tension	newton per meter	`N/m`
heat flux density	watt per square meter	`W/m^2`
radiant intensity	watt per steradian	`W/sr`
radiance	watt per square meter steradian	`W/(m^2.sr)`
heat capacity, entropy	joule per kelvin	`J/K`
specific heat or entropy	joule per kilogram kelvin	`J/(kg.K)`
specific energy	joule per kilogram	`J/kg`
thermal conductivity	watt per meter kelvin	`W/(m.K)`
energy density	joule per cubic meter	`J/m^3`
electric field strength	volt per meter	`V/m`
electric charge density	coulomb per cubic meter	`C/m^3`
electric flux density	coulomb per square meter	`C/m^2`
permittivity	farad per meter	`F/m`
permeability	henry per meter	`H/m`
molar energy	joule per mole	`J/mol`
molar entropy or heat	joule per mole kelvin	`J/(mol.K)`
exposure (x and g rays)	coulomb per kilogram	`C/kg`
absorbed dose rate	gray per second	`Gy/s`
rotational speed	revolution per minute	`r/min`
catalytic concentration	katal per cubic meter	`kat/m^3`
data rate	mebibit per second	`Mib/s`
noise voltage density	nanovolt per root hertz	`nV/Hz^(1/2)`
hourly rate	US Dollars per hour	`USD/h`
price	Euros per kilogram	`EUR/kg`
exchange rate	Japanese Yen per US Dollar	`JPY/USD`

Use of Metric Units by Computer Programs

Metric units attached to individual numerical values have the format described above. An unattached unit can be used to specify the units applying to a row, column, or entire table of numerical values; or for other purposes.

Programming language support for metric interchange should be provided by a function of two unit arguments returning a conversion factor. Multiplying a numerical value expressed in the second unit by the returned conversion factor yields the numerical value expressed in the first unit. This function must return a non-positive number if either of its arguments is not a syntactically valid unit; or if the conversion factor does not exist.

Examples

    UCF("km/s", "m/s" ) --> 0.001     UCF("N"   , "m/s" ) --> 0
    UCF("moC" , "oC"  ) --> 1000      UCF("mK"  , "oC"  ) --> 0
    UCF("rad" , "o"   ) --> 0.0174533 UCF("K"   , "o"   ) --> 0
    UCF("K"   , "K"   ) --> 1         UCF("oK"  , "oK"  ) --> -3
    UCF(""    , "s/s" ) --> 1         UCF("km/h", "mph" ) --> -2

Programming Language Extension

Lexical numerical constants in the programming languages C, Pascal, and Scheme could be extended to incorporate Metric Interchange Syntax compatibly with their current syntaxes; but this is not required for supporting input and output of units.

Rationales

Portability of Numbers

"Representation of numerical values in character strings for information interchanges", [ISO 6093], specifies the three machine-readable presentations in widespread use (Integer, Decimal, and Exponential notations) using only the characters:

<space>
<left-parenthesis>	(
<right-parenthesis>	)
<comma>	,
<plus-sign>	+
<hyphen-minus>	-
<period>	.
<E>	E
<e>	e
<digit>	0 - 9

In [UTF-7] the character PLUS-SIGN ("+") is not directly encoded, requiring multi-octet encoding. But every [ISO 6093] numeric value can be expressed without the use of PLUS-SIGN. So the number syntax given here does not include PLUS-SIGN.

Locale charsets all support the digits 0 to 9. There are only 3 LC_NUMERIC attributes: decimal_point, thousands_sep, and grouping. [ISO 6093] specifies use of either "." or "," for the decimal point. [ISO 6093] does not allow grouping. There is no LC_NUMERIC attribute for exponent. Thus Latin characters ("e" or "E") must be available in all languages which support [ISO 6093].

The programming languages C, Fortran, PL/I, Pascal, and Scheme accept [ISO 6093] numbers both as lexical constants and as input data.

Portable Character Set

Of the SI symbols, the "micro" prefix (GREEK-SMALL-LETTER-MU or MICRO-SIGN), "ohm" symbol (GREEK-CAPITAL-LETTER-OMEGA), and "degree" symbol (DEGREE-SIGN) are not supported by all charset encodings. By substituting "u", "Ohm", and "o" respectively, the unit symbols remain readable while preserving the system's unambiguity.

Taylor recommends using the MIDDLE-DOT character between multiplied unit symbols. To support those charset encodings lacking MIDDLE-DOT, metric interchange format instead uses PERIOD (".").

The unit superscript exponents could be formed using SUPERSCRIPT-MINUS, SUPERSCRIPT-ONE, SUPERSCRIPT-TWO, SUPERSCRIPT-THREE, etc. But these characters are not universal. So the CIRCUMFLEX ("^") is placed between a unit and its exponent, written with a portable (HYPHEN-MINUS and) digit.

The symbol for the liter, L, was adopted by the General Conference on Weights and Measures in order to avoid the risk of confusion between the letter l and the number 1 (see [NIST 811] - Units Outside the SI).

Metric Interchange Format (including numbers) uses only the characters:

<left-parenthesis>	(
<right-parenthesis>	)
<comma>	,
<hyphen-minus>	-
<period>	.
<solidus>	/
<circumflex>	^
<digit>	0 - 9
<upper>	A - Z
<lower>	a - z

Binary Units and Prefixes

Computer professionals sometimes use the term "kilobyte" to mean 1024 bytes. However, standards for data interchange must be unambiguous in all contexts. In December 1998 the International Electrotechnical Commission (IEC) approved as an IEC International Standard [IEC 60027-2] names and symbols for prefixes for binary multiples for use in the fields of data processing and data transmission.

As of 2000, the units bit and byte have not been accepted for use with SI, but are in widespread use. The IEC symbols are "B" for byte and "bit" for bit. To avoid conflict for "B", the bel was replaced by the decibel (dB).

Miscellany

Because white noise power in a bandwidth is proportional to that bandwidth, electronic noise units can have fractional exponents as in nV/Hz^(1/2) (nanovolt per root hertz).

Degree Celsius (oC) is not convertible to kelvin (K) by multiplication of a constant. Thus the formula "oC = K - 273.15" does not appear in the "Unit Symbols" table; and the conversion-factor function must return a non-positive number when called to convert between oC and K.

Programming Language Syntax Extension

Because a PERIOD (".") after a numerical lexical constant is not specified in the syntax of the programming languages C, Pascal, and Scheme, the syntax of their lexical constants could be extended to incorporate SI unit symbols. The syntax of "double" in Java could similarly be extended.

Acknowledgements

Arnold G. Reinhold helped complete and clarify ideas and presentation. Jon Krom discovered disparities between the text and syntax; and suggested clarifications.

References

[ANSI X3.50]: ANSI, Representations for U.S. customary, SI, and other units to be used in systems with limited character sets, ANSI X3.50, 1986.
[CODATA]: P. Mohr and B. Taylor, CODATA Recommended Values of the Fundamental Physical Constants, National Institute of standards and Technology, 2006.
[CRC]: Chemical Rubber Company, CRC handbook of chemistry and physics, CRC Press, 67th edition, 1986.
[IEC 60027-2]: IEC, Amendment 2 to IEC International Standard IEC 60027-2: Letter symbols to be used in electrical technology - Part 2: Telecommunications and electronics., January 1999.
[ISO 2955]: ISO, Information processing-Representation of SI and other units in systems with limited character sets, ISO 2955:1983.
[ISO 6093]: ISO, Representation of numerical values in character strings for information interchanges, ISO 6093:1985.
[NASA 1999]: NASA, Mars Climate Orbiter Failure Board Releases Report, http://mars.jpl.nasa.gov/msp98/news/mco991110, November 1999.
[NIST 811]: Taylor, B., Guide for the Use of the International System of Units (SI), NIST Special Publication 811, 1995 Edition.
[PCS]: Portable Character Set
The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition
[SI]: Bureau International des Poids et Mesures, The International System of Units (SI), 8th edition, 2006.
[UCS]: ISO, Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane (BMP), ISO/IEC 10646-1, March 2000.
[UNICODE]: The Unicode Consortium, The Unicode Standard, Version 3.0 Addison-Wesley Pub Co, February, 2000.
[UTF-7]: D. Goldsmith, UTF-7, A Mail-Safe Transformation Format of Unicode, RFC 2152, May 1997.
[UTF-8]: F. Yergeau, UTF-8, a transformation format of ISO 10646, RFC 2279, January 1998.

Metric Interchange Syntax

Here is a YACC-like syntax for metric quantities (with [ISO 6093] numbers).

 quantity_value
        : real
        | real '.' unit
        ;

 unit
        : unit_product
        | unit_product '/' single_unit
        ;

 unit_product
        : single_unit
        | unit_product '.' single_unit
        ;

 single_unit
        : punit
        | punit '^' uxponent
        | '(' unit ')'
        | '(' unit ')^' uxponent
        ;

 uxponent
        : uinteger
        | '-' uinteger
        | '(' uinteger '/' uinteger ')'
        | '(-' uinteger '/' uinteger ')'
        ;

 punit
        : decimal_multiple_prefix unit_p_symbol
        | decimal_submultiple_prefix unit_n_symbol
        | decimal_multiple_prefix unit_b_symbol
        | decimal_submultiple_prefix unit_b_symbol
        | binary_prefix 'B'
        | binary_prefix 'bit'
        | unit_p_symbol
        | unit_n_symbol
        | unit_b_symbol
        | unit___symbol
        ;

 decimal_multiple_prefix
        : 'E' | 'G' | 'M' | 'P' | 'T' | 'Y' | 'Z' | 'da' | 'h' | 'k'
        ;

 decimal_submultiple_prefix
        : 'a' | 'c' | 'd' | 'f' | 'm' | 'n' | 'p' | 'u' | 'y' | 'z'
        ;

 binary_prefix
        : 'Ei' | 'Gi' | 'Ki' | 'Mi' | 'Pi' | 'Ti'
        ;

 unit_p_symbol
        : 'B' | 'Bd' | 'r' | 't'
        ;

 unit_n_symbol
        : 'L' | 'Np' | 'o' | 'oC' | 'rad' | 'sr'
        ;

 unit_b_symbol
        : 'A' | 'Bq' | 'C' | 'F' | 'Gy' | 'H' | 'Hz' | 'J' | 'K' | 'N'
        | 'Ohm' | 'Pa' | 'S' | 'Sv' | 'T' | 'V' | 'W' | 'Wb' | 'bit'
        | 'cd' | 'eV' | 'g' | 'kat' | 'lm' | 'lx' | 'm' | 'mol' | 's'
        ;

 unit___symbol
        : 'd' | 'dB' | 'h' | 'min' | 'u'
        ;

 real
        : ureal
        | '-' ureal
        ;

 ureal
        : numerical_value
        | numerical_value suffix
        ;

 numerical_value
        : uinteger
        | dot uinteger
        | uinteger dot uinteger
        | uinteger dot
        ;

 dot
        : '.' | ','
        ;

 uinteger
        : digit uinteger
        | uinteger
        ;

 suffix
        : exponent_marker uinteger
        | exponent_marker '-' uinteger
        ;

 exponent_marker
        : 'e' | 'E'
        ;

 digit
        : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
        ;

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implmentation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and these terms are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to Voluntocracy, except as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be revoked by Voluntocracy or its successors or assigns.

This document and the information contained herein is provided on an "as is" basis and Voluntocracy disclaims all warranties, express or implied, including but not limited to any warranty that the use of the information herein will not infringe any rights or any implied warranties of merchantability or fitness for a particular purpose.

I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory. My actions and comments do not reflect in any way on MIT.
	Aubrey Jaffer	agj @ alum.mit.edu	Go Figure!