I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.

Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?

UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.

UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)

edited Jan 2 at 15:47

asked Jan 2 at 8:10

PolyglotPiglet

9817

What is your floats precision?

– talex
Jan 2 at 8:13

2

Possible duplicate of Manipulating and comparing floating points in java

– mkjh
Jan 2 at 8:13

What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

– Murat Güvenç
Jan 2 at 8:53

The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

– PolyglotPiglet
Jan 2 at 9:39

1

@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

– Eric Postpischil
Jan 2 at 12:35

|
show 8 more comments

Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?

UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.

UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)

edited Jan 2 at 15:47

asked Jan 2 at 8:10

PolyglotPiglet

9817

What is your floats precision?

– talex
Jan 2 at 8:13

2

Possible duplicate of Manipulating and comparing floating points in java

– mkjh
Jan 2 at 8:13

What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

– Murat Güvenç
Jan 2 at 8:53

The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

– PolyglotPiglet
Jan 2 at 9:39

1

@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

– Eric Postpischil
Jan 2 at 12:35

|
show 8 more comments

Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?

UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.

UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)

edited Jan 2 at 15:47

asked Jan 2 at 8:10

PolyglotPiglet

9817

Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?

UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.

UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)

java floating-point floating-accuracy

edited Jan 2 at 15:47

asked Jan 2 at 8:10

PolyglotPiglet

9817

edited Jan 2 at 15:47

asked Jan 2 at 8:10

PolyglotPiglet

9817

edited Jan 2 at 15:47

asked Jan 2 at 8:10

PolyglotPiglet

9817

asked Jan 2 at 8:10

PolyglotPiglet

9817

asked Jan 2 at 8:10

PolyglotPiglet

9817

What is your floats precision?

– talex
Jan 2 at 8:13

2

Possible duplicate of Manipulating and comparing floating points in java

– mkjh
Jan 2 at 8:13

What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

– Murat Güvenç
Jan 2 at 8:53

The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

– PolyglotPiglet
Jan 2 at 9:39

1

@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

– Eric Postpischil
Jan 2 at 12:35

|
show 8 more comments

What is your floats precision?

– talex
Jan 2 at 8:13

2

Possible duplicate of Manipulating and comparing floating points in java

– mkjh
Jan 2 at 8:13

What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

– Murat Güvenç
Jan 2 at 8:53

The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

– PolyglotPiglet
Jan 2 at 9:39

1

@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

– Eric Postpischil
Jan 2 at 12:35

What is your floats precision?

– talex
Jan 2 at 8:13

Possible duplicate of Manipulating and comparing floating points in java

– mkjh
Jan 2 at 8:13

What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

– Murat Güvenç
Jan 2 at 8:53

The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

– PolyglotPiglet
Jan 2 at 9:39

@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

– Eric Postpischil
Jan 2 at 12:35

|
show 8 more comments

2 Answers
2

active

oldest

votes

Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.

However, the result of this analysis would not be a useful value for an epsilon.

Consider the following example:

float a = 1.0E-37f;

float b = Float.MIN_VALUE;



System.out.println((a+b) == a);

System.out.println(a);

System.out.println(1.0f - a);

This prints:

true

1.0E-37

1.0

So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.

This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0 when using infinite precision is actually 1.0f - obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.

Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.

As @EricPostpischil mentions in the comments,

the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).

As you can see from this formula, it takes a large value of n to come to an error of 1.0. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).

Putting in some numbers:

int n = 33554433;

System.out.println((n-1)*0.5*Math.ulp(0.5f));

Yields

1.0

If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.

My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.

edited Jan 2 at 13:59

answered Jan 2 at 8:45

Hulk

3,42812142

This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

– Eric Postpischil
Jan 2 at 12:43

First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

– Eric Postpischil
Jan 2 at 13:58

Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

– Eric Postpischil
Jan 2 at 13:59

@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

– Hulk
Jan 2 at 14:03

The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

– Eric Postpischil
Jan 2 at 14:03

|
show 2 more comments

This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.

In this answer, I use the conditions:

Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

We have a sequence of numbers s₀, s₁, s₂,… s_n−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.

(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)

Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)

Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)

Ordering by least bit is exact.

Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2⁻⁵, its trailing one bit is 0.001•2⁻⁵ = 2⁻⁸.

Suppose we use this algorithm to sum the numbers:

Let S be a set containing the numbers in the sequence.

Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

Replace a and b in S with the computed sum of a and b.

Repeat until S contains one element.

The number remaining in S is exactly 1, with no error.

Proof:

If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

So all arithmetic operations in the algorithm are exact, so the final sum is exact.

Sorting by magnitude

Suppose the numbers are sorted in ascending order, and we add them sequentially: S₀ = s₀, S₁ = S₀ + s₁, S₂ = S₁ + s₂,… S_n−1 = S_n−2 + s_n−1. Here, the S_i represent exact mathematical sums. Let T_i be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.

A bound on the error in the addition that produces each T_i is uT_i. For this initial, analysis, I will assume uS_i adequately approximates uS_i. Then a bound on the total error is uS_i summed over i, excluding i=0 (since there is no error in setting S₀ to s₀; the first addition occurs with S₁), which I will write sum(uS_i).

This equals u sum(S_i). And sum(S_i) = S₁ + S₂ + … S_n−1 = (s₀) + (s₀ + s₁) + (s₀ + s₁ + s₂) + ... = (n−2)•s₀ + (n−2)•s₁ + (n−3)•s₂ + 1•s_n−1. Permitting any real values in [0, 1], this sum is maximized when s₀ and the remaining s_i are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of T_i with S_i).

Then sum(S_i) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.

Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the S_i stand in for the T_i.) Typical error would of course be much smaller.

Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2¹⁴⁹ when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2¹⁴⁷. However, the constraints of the problem necessitate that each s_i is exactly 2⁻¹⁴⁹, in which case T_2²⁴−1 is 2⁻¹²⁵ (no error has yet occurred) but then each subsequent T_i is also 2⁻¹²⁵ due to rounding (adding 2⁻¹⁴⁹ to 2⁻¹²⁵ yields 2⁻¹²⁵ due to the precision), so the final result is 2⁻¹²⁵, meaning the error is 1−2⁻¹²⁵, which is much less than 2¹⁴⁷.

Arbitrary ordering

Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.

Discussion

The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uS_i, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.

The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003108%2fwhat-is-a-suitable-tolerance-for-expecting-n-floats-to-sum-to-1%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.

However, the result of this analysis would not be a useful value for an epsilon.

Consider the following example:

float a = 1.0E-37f;

float b = Float.MIN_VALUE;



System.out.println((a+b) == a);

System.out.println(a);

System.out.println(1.0f - a);

This prints:

true

1.0E-37

1.0

So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.

Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.

As @EricPostpischil mentions in the comments,

the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).

Putting in some numbers:

int n = 33554433;

System.out.println((n-1)*0.5*Math.ulp(0.5f));

Yields

1.0

If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.

My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.

edited Jan 2 at 13:59

answered Jan 2 at 8:45

Hulk

3,42812142

This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

– Eric Postpischil
Jan 2 at 12:43

First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

– Eric Postpischil
Jan 2 at 13:58

Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

– Eric Postpischil
Jan 2 at 13:59

@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

– Hulk
Jan 2 at 14:03

The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

– Eric Postpischil
Jan 2 at 14:03

|
show 2 more comments

Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.

However, the result of this analysis would not be a useful value for an epsilon.

Consider the following example:

float a = 1.0E-37f;

float b = Float.MIN_VALUE;



System.out.println((a+b) == a);

System.out.println(a);

System.out.println(1.0f - a);

This prints:

true

1.0E-37

1.0

So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.

Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.

As @EricPostpischil mentions in the comments,

the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).

Putting in some numbers:

int n = 33554433;

System.out.println((n-1)*0.5*Math.ulp(0.5f));

Yields

1.0

If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.

My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.

edited Jan 2 at 13:59

answered Jan 2 at 8:45

Hulk

3,42812142

This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

– Eric Postpischil
Jan 2 at 12:43

First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

– Eric Postpischil
Jan 2 at 13:58

Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

– Eric Postpischil
Jan 2 at 13:59

@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

– Hulk
Jan 2 at 14:03

The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

– Eric Postpischil
Jan 2 at 14:03

|
show 2 more comments

Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.

However, the result of this analysis would not be a useful value for an epsilon.

Consider the following example:

float a = 1.0E-37f;

float b = Float.MIN_VALUE;



System.out.println((a+b) == a);

System.out.println(a);

System.out.println(1.0f - a);

This prints:

true

1.0E-37

1.0

So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.

Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.

As @EricPostpischil mentions in the comments,

the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).

Putting in some numbers:

int n = 33554433;

System.out.println((n-1)*0.5*Math.ulp(0.5f));

Yields

1.0

If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.

My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.

edited Jan 2 at 13:59

answered Jan 2 at 8:45

Hulk

3,42812142

Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.

However, the result of this analysis would not be a useful value for an epsilon.

Consider the following example:

float a = 1.0E-37f;

float b = Float.MIN_VALUE;



System.out.println((a+b) == a);

System.out.println(a);

System.out.println(1.0f - a);

This prints:

true

1.0E-37

1.0

So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.

Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.

As @EricPostpischil mentions in the comments,

the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).

Putting in some numbers:

int n = 33554433;

System.out.println((n-1)*0.5*Math.ulp(0.5f));

Yields

1.0

If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.

My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.

edited Jan 2 at 13:59

answered Jan 2 at 8:45

Hulk

3,42812142

edited Jan 2 at 13:59

answered Jan 2 at 8:45

Hulk

3,42812142

answered Jan 2 at 8:45

Hulk

3,42812142

answered Jan 2 at 8:45

Hulk

3,42812142

This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

– Eric Postpischil
Jan 2 at 12:43

First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

– Eric Postpischil
Jan 2 at 13:58

Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

– Eric Postpischil
Jan 2 at 13:59

@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

– Hulk
Jan 2 at 14:03

The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

– Eric Postpischil
Jan 2 at 14:03

|
show 2 more comments

This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

– Eric Postpischil
Jan 2 at 12:43

First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

– Eric Postpischil
Jan 2 at 13:58

Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

– Eric Postpischil
Jan 2 at 13:59

@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

– Hulk
Jan 2 at 14:03

The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

– Eric Postpischil
Jan 2 at 14:03

This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

– Eric Postpischil
Jan 2 at 12:43

First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

– Eric Postpischil
Jan 2 at 13:58

Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

– Eric Postpischil
Jan 2 at 13:59

@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

– Hulk
Jan 2 at 14:03

The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

– Eric Postpischil
Jan 2 at 14:03

|
show 2 more comments

This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.

In this answer, I use the conditions:

Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

We have a sequence of numbers s₀, s₁, s₂,… s_n−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.

Ordering by least bit is exact.

Suppose we use this algorithm to sum the numbers:

Let S be a set containing the numbers in the sequence.

Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

Replace a and b in S with the computed sum of a and b.

Repeat until S contains one element.

The number remaining in S is exactly 1, with no error.

Proof:

If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

So all arithmetic operations in the algorithm are exact, so the final sum is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

add a comment |

This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.

In this answer, I use the conditions:

Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

We have a sequence of numbers s₀, s₁, s₂,… s_n−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.

Ordering by least bit is exact.

Suppose we use this algorithm to sum the numbers:

Let S be a set containing the numbers in the sequence.

Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

Replace a and b in S with the computed sum of a and b.

Repeat until S contains one element.

The number remaining in S is exactly 1, with no error.

Proof:

If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

So all arithmetic operations in the algorithm are exact, so the final sum is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

add a comment |

This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.

In this answer, I use the conditions:

Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

We have a sequence of numbers s₀, s₁, s₂,… s_n−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.

Ordering by least bit is exact.

Suppose we use this algorithm to sum the numbers:

Let S be a set containing the numbers in the sequence.

Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

Replace a and b in S with the computed sum of a and b.

Repeat until S contains one element.

The number remaining in S is exactly 1, with no error.

Proof:

If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

So all arithmetic operations in the algorithm are exact, so the final sum is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.

In this answer, I use the conditions:

Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

We have a sequence of numbers s₀, s₁, s₂,… s_n−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.

Ordering by least bit is exact.

Suppose we use this algorithm to sum the numbers:

Let S be a set containing the numbers in the sequence.

Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

Replace a and b in S with the computed sum of a and b.

Repeat until S contains one element.

The number remaining in S is exactly 1, with no error.

Proof:

If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

So all arithmetic operations in the algorithm are exact, so the final sum is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

answered Jan 3 at 21:53

Eric Postpischil

78.9k886166

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu

What is a suitable tolerance for expecting n floats to sum to 1

2 Answers
2

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Post as a guest

Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

How to fix TextFormField cause rebuild widget in Flutter

Category

Random preview

What is a suitable tolerance for expecting n floats to sum to 1

2 Answers 2

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

How to fix TextFormField cause rebuild widget in Flutter

2 Answers
2

2 Answers
2

2 Answers
2