What is a suitable tolerance for expecting n floats to sum to 1
I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.
Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?
UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.
UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)
java floating-point floating-accuracy
|
show 8 more comments
I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.
Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?
UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.
UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)
java floating-point floating-accuracy
What is your floats precision?
– talex
Jan 2 at 8:13
2
Possible duplicate of Manipulating and comparing floating points in java
– mkjh
Jan 2 at 8:13
What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal
– Murat Güvenç
Jan 2 at 8:53
The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.
– PolyglotPiglet
Jan 2 at 9:39
1
@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.
– Eric Postpischil
Jan 2 at 12:35
|
show 8 more comments
I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.
Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?
UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.
UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)
java floating-point floating-accuracy
I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.
Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?
UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.
UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)
java floating-point floating-accuracy
java floating-point floating-accuracy
edited Jan 2 at 15:47
PolyglotPiglet
asked Jan 2 at 8:10
PolyglotPigletPolyglotPiglet
9817
9817
What is your floats precision?
– talex
Jan 2 at 8:13
2
Possible duplicate of Manipulating and comparing floating points in java
– mkjh
Jan 2 at 8:13
What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal
– Murat Güvenç
Jan 2 at 8:53
The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.
– PolyglotPiglet
Jan 2 at 9:39
1
@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.
– Eric Postpischil
Jan 2 at 12:35
|
show 8 more comments
What is your floats precision?
– talex
Jan 2 at 8:13
2
Possible duplicate of Manipulating and comparing floating points in java
– mkjh
Jan 2 at 8:13
What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal
– Murat Güvenç
Jan 2 at 8:53
The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.
– PolyglotPiglet
Jan 2 at 9:39
1
@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.
– Eric Postpischil
Jan 2 at 12:35
What is your floats precision?
– talex
Jan 2 at 8:13
What is your floats precision?
– talex
Jan 2 at 8:13
2
2
Possible duplicate of Manipulating and comparing floating points in java
– mkjh
Jan 2 at 8:13
Possible duplicate of Manipulating and comparing floating points in java
– mkjh
Jan 2 at 8:13
What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal
– Murat Güvenç
Jan 2 at 8:53
What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal
– Murat Güvenç
Jan 2 at 8:53
The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.
– PolyglotPiglet
Jan 2 at 9:39
The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.
– PolyglotPiglet
Jan 2 at 9:39
1
1
@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.
– Eric Postpischil
Jan 2 at 12:35
@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.
– Eric Postpischil
Jan 2 at 12:35
|
show 8 more comments
2 Answers
2
active
oldest
votes
Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.
However, the result of this analysis would not be a useful value for an epsilon.
Consider the following example:
float a = 1.0E-37f;
float b = Float.MIN_VALUE;
System.out.println((a+b) == a);
System.out.println(a);
System.out.println(1.0f - a);
This prints:
true
1.0E-37
1.0
So, if you perform an arbitrary number of additions of Float.MIN_VALUE
to 1.0E-37f
, the difference of the sum to 1.0f
will still be 1.0f
.
This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0
when using infinite precision is actually 1.0f
- obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.
Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n
.
As @EricPostpischil mentions in the comments,
the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).
As you can see from this formula, it takes a large value of n
to come to an error of 1.0
. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).
Putting in some numbers:
int n = 33554433;
System.out.println((n-1)*0.5*Math.ulp(0.5f));
Yields
1.0
If your n
is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.
My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.
This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.
– Eric Postpischil
Jan 2 at 12:43
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.
– Eric Postpischil
Jan 2 at 13:58
Yes, given some positive numbersa
andb
, the computed sum ofa+b
might bea
, and so the error is 100% ofb
. But that is not what the question asks.
– Eric Postpischil
Jan 2 at 13:59
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".
– Hulk
Jan 2 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.
– Eric Postpischil
Jan 2 at 14:03
|
show 2 more comments
This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.
In this answer, I use the conditions:
- Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.
- We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.
(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)
Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)
Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)
Ordering by least bit is exact.
Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.
Suppose we use this algorithm to sum the numbers:
- Let S be a set containing the numbers in the sequence.
- Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.
- Replace a and b in S with the computed sum of a and b.
- Repeat until S contains one element.
The number remaining in S is exactly 1, with no error.
Proof:
- If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.
- The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.
- So all arithmetic operations in the algorithm are exact, so the final sum is exact.
Sorting by magnitude
Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.
A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).
This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).
Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.
Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.
Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.
Arbitrary ordering
Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.
Discussion
The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.
The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003108%2fwhat-is-a-suitable-tolerance-for-expecting-n-floats-to-sum-to-1%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.
However, the result of this analysis would not be a useful value for an epsilon.
Consider the following example:
float a = 1.0E-37f;
float b = Float.MIN_VALUE;
System.out.println((a+b) == a);
System.out.println(a);
System.out.println(1.0f - a);
This prints:
true
1.0E-37
1.0
So, if you perform an arbitrary number of additions of Float.MIN_VALUE
to 1.0E-37f
, the difference of the sum to 1.0f
will still be 1.0f
.
This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0
when using infinite precision is actually 1.0f
- obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.
Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n
.
As @EricPostpischil mentions in the comments,
the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).
As you can see from this formula, it takes a large value of n
to come to an error of 1.0
. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).
Putting in some numbers:
int n = 33554433;
System.out.println((n-1)*0.5*Math.ulp(0.5f));
Yields
1.0
If your n
is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.
My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.
This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.
– Eric Postpischil
Jan 2 at 12:43
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.
– Eric Postpischil
Jan 2 at 13:58
Yes, given some positive numbersa
andb
, the computed sum ofa+b
might bea
, and so the error is 100% ofb
. But that is not what the question asks.
– Eric Postpischil
Jan 2 at 13:59
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".
– Hulk
Jan 2 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.
– Eric Postpischil
Jan 2 at 14:03
|
show 2 more comments
Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.
However, the result of this analysis would not be a useful value for an epsilon.
Consider the following example:
float a = 1.0E-37f;
float b = Float.MIN_VALUE;
System.out.println((a+b) == a);
System.out.println(a);
System.out.println(1.0f - a);
This prints:
true
1.0E-37
1.0
So, if you perform an arbitrary number of additions of Float.MIN_VALUE
to 1.0E-37f
, the difference of the sum to 1.0f
will still be 1.0f
.
This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0
when using infinite precision is actually 1.0f
- obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.
Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n
.
As @EricPostpischil mentions in the comments,
the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).
As you can see from this formula, it takes a large value of n
to come to an error of 1.0
. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).
Putting in some numbers:
int n = 33554433;
System.out.println((n-1)*0.5*Math.ulp(0.5f));
Yields
1.0
If your n
is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.
My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.
This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.
– Eric Postpischil
Jan 2 at 12:43
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.
– Eric Postpischil
Jan 2 at 13:58
Yes, given some positive numbersa
andb
, the computed sum ofa+b
might bea
, and so the error is 100% ofb
. But that is not what the question asks.
– Eric Postpischil
Jan 2 at 13:59
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".
– Hulk
Jan 2 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.
– Eric Postpischil
Jan 2 at 14:03
|
show 2 more comments
Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.
However, the result of this analysis would not be a useful value for an epsilon.
Consider the following example:
float a = 1.0E-37f;
float b = Float.MIN_VALUE;
System.out.println((a+b) == a);
System.out.println(a);
System.out.println(1.0f - a);
This prints:
true
1.0E-37
1.0
So, if you perform an arbitrary number of additions of Float.MIN_VALUE
to 1.0E-37f
, the difference of the sum to 1.0f
will still be 1.0f
.
This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0
when using infinite precision is actually 1.0f
- obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.
Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n
.
As @EricPostpischil mentions in the comments,
the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).
As you can see from this formula, it takes a large value of n
to come to an error of 1.0
. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).
Putting in some numbers:
int n = 33554433;
System.out.println((n-1)*0.5*Math.ulp(0.5f));
Yields
1.0
If your n
is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.
My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.
Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.
However, the result of this analysis would not be a useful value for an epsilon.
Consider the following example:
float a = 1.0E-37f;
float b = Float.MIN_VALUE;
System.out.println((a+b) == a);
System.out.println(a);
System.out.println(1.0f - a);
This prints:
true
1.0E-37
1.0
So, if you perform an arbitrary number of additions of Float.MIN_VALUE
to 1.0E-37f
, the difference of the sum to 1.0f
will still be 1.0f
.
This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0
when using infinite precision is actually 1.0f
- obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.
Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n
.
As @EricPostpischil mentions in the comments,
the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).
As you can see from this formula, it takes a large value of n
to come to an error of 1.0
. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).
Putting in some numbers:
int n = 33554433;
System.out.println((n-1)*0.5*Math.ulp(0.5f));
Yields
1.0
If your n
is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.
My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.
edited Jan 2 at 13:59
answered Jan 2 at 8:45
HulkHulk
3,42812142
3,42812142
This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.
– Eric Postpischil
Jan 2 at 12:43
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.
– Eric Postpischil
Jan 2 at 13:58
Yes, given some positive numbersa
andb
, the computed sum ofa+b
might bea
, and so the error is 100% ofb
. But that is not what the question asks.
– Eric Postpischil
Jan 2 at 13:59
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".
– Hulk
Jan 2 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.
– Eric Postpischil
Jan 2 at 14:03
|
show 2 more comments
This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.
– Eric Postpischil
Jan 2 at 12:43
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.
– Eric Postpischil
Jan 2 at 13:58
Yes, given some positive numbersa
andb
, the computed sum ofa+b
might bea
, and so the error is 100% ofb
. But that is not what the question asks.
– Eric Postpischil
Jan 2 at 13:59
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".
– Hulk
Jan 2 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.
– Eric Postpischil
Jan 2 at 14:03
This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.
– Eric Postpischil
Jan 2 at 12:43
This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.
– Eric Postpischil
Jan 2 at 12:43
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.
– Eric Postpischil
Jan 2 at 13:58
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.
– Eric Postpischil
Jan 2 at 13:58
Yes, given some positive numbers
a
and b
, the computed sum of a+b
might be a
, and so the error is 100% of b
. But that is not what the question asks.– Eric Postpischil
Jan 2 at 13:59
Yes, given some positive numbers
a
and b
, the computed sum of a+b
might be a
, and so the error is 100% of b
. But that is not what the question asks.– Eric Postpischil
Jan 2 at 13:59
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".
– Hulk
Jan 2 at 14:03
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".
– Hulk
Jan 2 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.
– Eric Postpischil
Jan 2 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.
– Eric Postpischil
Jan 2 at 14:03
|
show 2 more comments
This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.
In this answer, I use the conditions:
- Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.
- We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.
(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)
Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)
Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)
Ordering by least bit is exact.
Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.
Suppose we use this algorithm to sum the numbers:
- Let S be a set containing the numbers in the sequence.
- Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.
- Replace a and b in S with the computed sum of a and b.
- Repeat until S contains one element.
The number remaining in S is exactly 1, with no error.
Proof:
- If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.
- The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.
- So all arithmetic operations in the algorithm are exact, so the final sum is exact.
Sorting by magnitude
Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.
A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).
This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).
Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.
Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.
Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.
Arbitrary ordering
Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.
Discussion
The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.
The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.
add a comment |
This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.
In this answer, I use the conditions:
- Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.
- We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.
(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)
Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)
Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)
Ordering by least bit is exact.
Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.
Suppose we use this algorithm to sum the numbers:
- Let S be a set containing the numbers in the sequence.
- Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.
- Replace a and b in S with the computed sum of a and b.
- Repeat until S contains one element.
The number remaining in S is exactly 1, with no error.
Proof:
- If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.
- The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.
- So all arithmetic operations in the algorithm are exact, so the final sum is exact.
Sorting by magnitude
Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.
A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).
This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).
Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.
Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.
Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.
Arbitrary ordering
Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.
Discussion
The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.
The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.
add a comment |
This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.
In this answer, I use the conditions:
- Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.
- We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.
(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)
Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)
Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)
Ordering by least bit is exact.
Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.
Suppose we use this algorithm to sum the numbers:
- Let S be a set containing the numbers in the sequence.
- Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.
- Replace a and b in S with the computed sum of a and b.
- Repeat until S contains one element.
The number remaining in S is exactly 1, with no error.
Proof:
- If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.
- The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.
- So all arithmetic operations in the algorithm are exact, so the final sum is exact.
Sorting by magnitude
Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.
A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).
This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).
Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.
Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.
Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.
Arbitrary ordering
Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.
Discussion
The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.
The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.
This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.
In this answer, I use the conditions:
- Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.
- We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.
(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)
Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)
Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)
Ordering by least bit is exact.
Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.
Suppose we use this algorithm to sum the numbers:
- Let S be a set containing the numbers in the sequence.
- Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.
- Replace a and b in S with the computed sum of a and b.
- Repeat until S contains one element.
The number remaining in S is exactly 1, with no error.
Proof:
- If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.
- The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.
- So all arithmetic operations in the algorithm are exact, so the final sum is exact.
Sorting by magnitude
Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.
A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).
This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).
Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.
Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.
Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.
Arbitrary ordering
Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.
Discussion
The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.
The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.
answered Jan 3 at 21:53
Eric PostpischilEric Postpischil
78.9k886166
78.9k886166
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003108%2fwhat-is-a-suitable-tolerance-for-expecting-n-floats-to-sum-to-1%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What is your floats precision?
– talex
Jan 2 at 8:13
2
Possible duplicate of Manipulating and comparing floating points in java
– mkjh
Jan 2 at 8:13
What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal
– Murat Güvenç
Jan 2 at 8:53
The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.
– PolyglotPiglet
Jan 2 at 9:39
1
@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.
– Eric Postpischil
Jan 2 at 12:35