Floating‑point arithmetic

Floating‑point arithmetic is the dominant numerical model for representing and manipulating real‑valued quantities on general‑purpose computers. It encodes numbers as sign, exponent, and significand fields in fixed‑width formats and defines arithmetic, rounding, and exception semantics to enable portable and predictable results across platforms, chiefly via the IEEE 754 standard and its ISO/IEC adoption. IEEE SA; ISO.

Historical development and standardization

Efforts to unify floating‑point behavior culminated in the 1985 publication of IEEE 754 for binary floating‑point; subsequent revisions in 2008 and 2019 broadened and clarified the standard, adding decimal formats and codifying fused multiply‑add among other features. IEEE SA; IEEE SA. Influential expository work by William Kahan and by David Goldberg in ACM Computing Surveys shaped practitioner understanding and popularized key error‑analysis concepts. Oracle Developer Studio reprint of Goldberg 1991; ACM CSUR index.

Representation and formats

IEEE 754 specifies binary and decimal formats defined by a radix (2 or 10), a precision in digits, and a bounded exponent range. Widely used basic binary formats are binary16 (half), binary32 (single), binary64 (double), and binary128 (quadruple); decimal32, decimal64, and decimal128 are the corresponding decimal formats. IEEE 754 (overview); ARM Learning Paths; ISO. Decimal formats use either densely packed decimal (DPD) or binary‑integer significand encodings while preserving identical numerical sets; DPD’s design was detailed by Cowlishaw to support efficient hardware and software implementation. IEEE 754 (decimal section); IBM Research: Densely Packed Decimal; General Decimal Arithmetic Specification.

IEEE formats include special values: signed zeros (+0 and −0), infinities (±∞), quiet and signaling NaNs, and subnormal numbers that preserve gradual underflow. These are integral to well‑defined arithmetic and comparisons. IEEE 754 (formats); Oracle Numerical Computation Guide.

Operations, rounding, and special values

The standard requires correctly rounded results for core arithmetic operations (add, subtract, multiply, divide, square root, remainder) and for fused multiply‑add, with language‑agnostic conversions, comparisons, and classification functions. IEEE SA; NVIDIA CUDA FP Guide. Five rounding modes are defined: round to nearest, ties to even; round to nearest, ties away; toward +∞; toward −∞; and toward 0. Round‑to‑nearest‑even reduces bias and limits error growth. IEEE 754 (rounding rules); Goldberg 1991 reprint.

Arithmetic raises status flags for invalid operation, division by zero, overflow, underflow, and inexact; systems may trap or just record these conditions and return specified default results. GNU C Library manual; Oracle Numerical Computation Guide. Subnormals provide “gradual underflow,” bounding relative error when magnitudes fall below the normal range; under round‑to‑nearest, the rounding error is at most one‑half unit in the last place. Oracle Numerical Computation Guide.

Accuracy measures and spacing

Spacing between adjacent representable numbers depends on the exponent and precision. Two standard measures are unit in the last place (ULP), the gap between consecutive representable numbers at a given magnitude, and machine epsilon, the ULP at 1.0 for a given format. Under correct rounding to nearest, basic operations err by at most 0.5 ULP. Intel documentation (ULP); Machine epsilon; Floating‑point arithmetic overview.

Numerical issues and algorithmic techniques

Loss of significance (cancellation) arises when subtracting nearly equal quantities, potentially amplifying prior rounding or measurement errors even when the subtraction itself is exact. Goldberg’s survey catalogues such pitfalls and mitigation strategies, including reformulations and compensated algorithms. Goldberg 1991 reprint; Catastrophic cancellation. The Sterbenz lemma states that subtraction of two floating‑point numbers within a factor of two of each other is exact (no rounding), a useful fact in error analysis. Sterbenz lemma; [Handbook of Floating‑Point Arithmetic](book://Jean-Michel Muller et al.|Handbook of Floating‑Point Arithmetic (2nd ed.)|Birkhäuser|2018).

Compensated summation, notably the Kahan summation algorithm, reduces accumulation error in long sums by tracking and reinjecting low‑order parts; pairwise summation achieves logarithmic error growth with favorable performance trade‑offs. Kahan summation algorithm; [Higham, Accuracy and Stability of Numerical Algorithms](book://Nicholas J. Higham|Accuracy and Stability of Numerical Algorithms (2nd ed.)|SIAM|2002). Fused multiply‑add computes a·b+c with a single rounding, improving both performance and accuracy in dot products, polynomial evaluation, and linear algebra routines; it became a required operation in the 2008 revision. NVIDIA CUDA FP Guide; Oracle addendum discussion.

Decimal floating‑point

Decimal formats (decimal32/64/128) represent coefficients in base 10 with defined exponent ranges, allowing exact representation and correctly rounded operations for decimal fractions common in financial and commercial software. IEEE 754 defines decimal interchange encodings using densely packed decimal or a binary‑integer coefficient encoding; both encode the same numerical set. IEEE 754 (decimal); General Decimal Arithmetic Specification; IBM Research: DPD. Decimal arithmetic’s motivation and design were articulated by Cowlishaw and incorporated during the 2008 revision, aligning hardware, languages, and databases with human‑centric rounding expectations. IBM Research (ARITH 2001 spec); ISO.

Language and system interfaces

Programming environments expose IEEE 754 features via numeric limits, rounding‑mode controls, nextafter‑style neighbor functions, and exception flags/traps, enabling reproducible algorithms and diagnostics in Numerical analysis and interval methods. The GNU C Library documents mapping between IEEE flags and language/runtime controls; similar facilities exist across platforms. GNU C Library manual; IEEE 754 (character/operations overview). Directed rounding modes support Interval arithmetic and rigorous error bounds in scientific codes. IEEE 754 (rounding).

Key references

David Goldberg’s survey “What Every Computer Scientist Should Know About Floating‑Point Arithmetic” remains a widely cited tutorial; the IEEE/ISO standard documents are the normative references; Higham and Muller et al. provide comprehensive algorithmic analysis. Oracle reprint of Goldberg 1991; IEEE SA; [Higham (book)](book://Nicholas J. Higham|Accuracy and Stability of Numerical Algorithms (2nd ed.)|SIAM|2002); [Handbook of Floating‑Point Arithmetic](book://Jean-Michel Muller et al.|Handbook of Floating‑Point Arithmetic (2nd ed.)|Birkhäuser|2018).

IEEE 754: internal link William Kahan: internal link ACM Computing Surveys: internal link IBM: internal link ISO/IEC 60559: internal link Numerical analysis: internal link Interval arithmetic: internal link