If you have any questions or feedback, pleasefill out this form
This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.
Numerical Stability and Errors
The equation (0.1 + 0.2) == 0.3
seems very straightforward. However, due to the way computers store floating-point numbers, this equation does not hold true in many programming languages, resulting in false
.
But why does the equation (0.5 + 0.25) == 0.75
hold true?
When performing floating-point arithmetic, some degree of error is inevitable. Next, we’ll discuss the origins of these errors and what can be done to mitigate them in calculations.
How Floating-Point Numbers Are Stored in Computers
Floating-point numbers are represented in computers using bits, as specified by the IEEE754 standard. Single-precision floating-point numbers use 32 bits, while double-precision uses 64 bits. The representation is divided into a sign bit for indicating positive or negative values, a fractional part, and an exponent part.
- Sign Bit (1 bit): 0 for positive; 1 for negative
- Fractional Part
- Exponent Part
Type | Size | Exponent Part | Fractional Part |
---|---|---|---|
Single | 32 bit | 8 bit | 23 bit |
Double | 64 bit | 11 bit | 52 bit |
Taking single precision as an example, the exponent part uses 8 bits, allowing for a range of 0 to 255. However, to represent negative numbers, the IEEE754 standard requires an offset to be added, which determines the final number stored in the exponent part. For single precision, this offset is 127.
For example, the representation of -14.75 in floating-point would be:
- Sign Bit: Negative, so it is
1
- Exponent Part: First, convert 14.75 to binary
1110.11
. After normalization, it becomes1.11011 * 2^3
, so the exponent is 3 plus 127, which equals 130, represented in binary as10000010
- Fractional Part:
11011
, with remaining bits as0
Thus, the floating-point representation of -14.75 is: 1 10000010 11011000000000000000000
From this, we can understand where errors originate. Since the exponent is negative, the resulting numbers fall between 0 < x < 1
. Therefore, if a number cannot be expressed as 2^-x
, it can only be approximated closely, not exactly. The previously mentioned 0.5 + 0.25
works without error because these numbers can be expressed as 2^-1 + 2^-2
.
Why Store This Way?
Using scientific notation helps us manage numbers of varying scales while maintaining a certain level of precision. For instance, 0.00000012345
and 1234567890000
can be expressed as 1.2345E-7
and 1.23456789E12
, respectively. In computers, numbers are usually stored in floating-point format, similar to scientific notation, but with decimal representation switched to binary.
Real numbers are infinite, but computer storage is limited, so errors are unavoidable regardless of how we store numbers. What we can do is find a trade-off between precision and the range of representable numbers.
Significant Figures
Significant figures help us gauge precision. Numbers like 0.001
or 0.0135
can be written as and . Here, 0.001
has 1 significant figure, while 1.35
has 3. The more significant figures, the higher the precision.
A simple rule for determining significant figures from Wikipedia states:
- All non-zero digits are significant
- Zeros between non-zero digits are significant
- Leading zeros are always insignificant
- For numbers requiring a decimal point, trailing zeros (zeros after the last non-zero digit) are significant
- For numbers not requiring a decimal point, trailing zeros may or may not be significant, depending on additional notation or error messages.
Cancellation of Significant Digits
When subtracting two floating-point numbers with very similar absolute values, the majority of digits may cancel out, leaving many zeros, which reduces the number of significant figures. This phenomenon is known as cancellation of significant digits.
For example: (1.234567890 - 1.234567889)
results in 0.000000001
, but due to insufficient precision, it could be rounded to 0
.
This is something to be particularly mindful of in numerical computations. For example, consider the double angle and half angle formulas:
When the angle is small, the cosine value is very close to 1, making subtraction from 1 prone to significant figure loss. For instance, with an angle of 1 degree, using the double angle formula (with 6 significant figures):
If we directly use the right-hand formula and look up the value:
The discrepancy between the two results is significant. It's crucial to be careful in numerical calculations. To avoid loss of significant figures, consider the following methods:
-
Avoid arithmetic operations between two numbers that are very close in absolute value
-
Use alternative formulas for calculations (like the double angle formula mentioned above)
- Essentially, this means avoiding arithmetic between two numbers that are close in absolute value.
-
Increase precision
Conclusion
More experienced engineers are likely aware that floating-point arithmetic can introduce errors and understand why 0.1 + 0.2 != 0.3
. This article delves into the storage methods for decimals, IEEE754 standards, and the loss of significant figures.
The double angle formula is often encountered during high school trigonometry classes; for me, it's just a formula to apply.
However, in real-world applications, calculations are performed by computers, and everyday scenarios rarely involve nice, round angles like 30 or 60 degrees. Teachers also seldom mention the issue of significant figure loss associated with double angle formulas.
If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨
☕Buy me a coffee