What is a suitable tolerance for expecting n floats to sum to 1












0















I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.



Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?



UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.



UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)










share|improve this question

























  • What is your floats precision?

    – talex
    Jan 2 at 8:13






  • 2





    Possible duplicate of Manipulating and comparing floating points in java

    – mkjh
    Jan 2 at 8:13











  • What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

    – Murat Güvenç
    Jan 2 at 8:53











  • The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

    – PolyglotPiglet
    Jan 2 at 9:39








  • 1





    @mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

    – Eric Postpischil
    Jan 2 at 12:35
















0















I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.



Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?



UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.



UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)










share|improve this question

























  • What is your floats precision?

    – talex
    Jan 2 at 8:13






  • 2





    Possible duplicate of Manipulating and comparing floating points in java

    – mkjh
    Jan 2 at 8:13











  • What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

    – Murat Güvenç
    Jan 2 at 8:53











  • The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

    – PolyglotPiglet
    Jan 2 at 9:39








  • 1





    @mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

    – Eric Postpischil
    Jan 2 at 12:35














0












0








0








I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.



Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?



UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.



UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)










share|improve this question
















I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.



Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?



UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.



UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)







java floating-point floating-accuracy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 2 at 15:47







PolyglotPiglet

















asked Jan 2 at 8:10









PolyglotPigletPolyglotPiglet

9817




9817













  • What is your floats precision?

    – talex
    Jan 2 at 8:13






  • 2





    Possible duplicate of Manipulating and comparing floating points in java

    – mkjh
    Jan 2 at 8:13











  • What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

    – Murat Güvenç
    Jan 2 at 8:53











  • The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

    – PolyglotPiglet
    Jan 2 at 9:39








  • 1





    @mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

    – Eric Postpischil
    Jan 2 at 12:35



















  • What is your floats precision?

    – talex
    Jan 2 at 8:13






  • 2





    Possible duplicate of Manipulating and comparing floating points in java

    – mkjh
    Jan 2 at 8:13











  • What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

    – Murat Güvenç
    Jan 2 at 8:53











  • The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

    – PolyglotPiglet
    Jan 2 at 9:39








  • 1





    @mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

    – Eric Postpischil
    Jan 2 at 12:35

















What is your floats precision?

– talex
Jan 2 at 8:13





What is your floats precision?

– talex
Jan 2 at 8:13




2




2





Possible duplicate of Manipulating and comparing floating points in java

– mkjh
Jan 2 at 8:13





Possible duplicate of Manipulating and comparing floating points in java

– mkjh
Jan 2 at 8:13













What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

– Murat Güvenç
Jan 2 at 8:53





What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal

– Murat Güvenç
Jan 2 at 8:53













The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

– PolyglotPiglet
Jan 2 at 9:39







The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question.

– PolyglotPiglet
Jan 2 at 9:39






1




1





@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

– Eric Postpischil
Jan 2 at 12:35





@mkjh: This is not a duplicate of that problem. That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question.

– Eric Postpischil
Jan 2 at 12:35












2 Answers
2






active

oldest

votes


















1














Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.



However, the result of this analysis would not be a useful value for an epsilon.



Consider the following example:



float a = 1.0E-37f;
float b = Float.MIN_VALUE;

System.out.println((a+b) == a);
System.out.println(a);
System.out.println(1.0f - a);


This prints:



true
1.0E-37
1.0


So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.



This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0 when using infinite precision is actually 1.0f - obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.





Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.



As @EricPostpischil mentions in the comments,




the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).




As you can see from this formula, it takes a large value of n to come to an error of 1.0. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).



Putting in some numbers:



int n = 33554433;
System.out.println((n-1)*0.5*Math.ulp(0.5f));


Yields



1.0


If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.





My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.






share|improve this answer


























  • This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

    – Eric Postpischil
    Jan 2 at 12:43













  • First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

    – Eric Postpischil
    Jan 2 at 13:58













  • Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

    – Eric Postpischil
    Jan 2 at 13:59











  • @EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

    – Hulk
    Jan 2 at 14:03













  • The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

    – Eric Postpischil
    Jan 2 at 14:03





















0














This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.



In this answer, I use the conditions:




  • Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

  • We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.


(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)



Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)



Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)



Ordering by least bit is exact.



Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.



Suppose we use this algorithm to sum the numbers:




  • Let S be a set containing the numbers in the sequence.

  • Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

  • Replace a and b in S with the computed sum of a and b.

  • Repeat until S contains one element.


The number remaining in S is exactly 1, with no error.



Proof:




  • If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

  • The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

  • So all arithmetic operations in the algorithm are exact, so the final sum is exact.


Sorting by magnitude



Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.



A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).



This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).



Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.



Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.



Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.



Arbitrary ordering



Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.



Discussion



The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.



The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003108%2fwhat-is-a-suitable-tolerance-for-expecting-n-floats-to-sum-to-1%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.



    However, the result of this analysis would not be a useful value for an epsilon.



    Consider the following example:



    float a = 1.0E-37f;
    float b = Float.MIN_VALUE;

    System.out.println((a+b) == a);
    System.out.println(a);
    System.out.println(1.0f - a);


    This prints:



    true
    1.0E-37
    1.0


    So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.



    This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0 when using infinite precision is actually 1.0f - obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.





    Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.



    As @EricPostpischil mentions in the comments,




    the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).




    As you can see from this formula, it takes a large value of n to come to an error of 1.0. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).



    Putting in some numbers:



    int n = 33554433;
    System.out.println((n-1)*0.5*Math.ulp(0.5f));


    Yields



    1.0


    If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.





    My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.






    share|improve this answer


























    • This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

      – Eric Postpischil
      Jan 2 at 12:43













    • First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

      – Eric Postpischil
      Jan 2 at 13:58













    • Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

      – Eric Postpischil
      Jan 2 at 13:59











    • @EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

      – Hulk
      Jan 2 at 14:03













    • The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

      – Eric Postpischil
      Jan 2 at 14:03


















    1














    Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.



    However, the result of this analysis would not be a useful value for an epsilon.



    Consider the following example:



    float a = 1.0E-37f;
    float b = Float.MIN_VALUE;

    System.out.println((a+b) == a);
    System.out.println(a);
    System.out.println(1.0f - a);


    This prints:



    true
    1.0E-37
    1.0


    So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.



    This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0 when using infinite precision is actually 1.0f - obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.





    Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.



    As @EricPostpischil mentions in the comments,




    the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).




    As you can see from this formula, it takes a large value of n to come to an error of 1.0. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).



    Putting in some numbers:



    int n = 33554433;
    System.out.println((n-1)*0.5*Math.ulp(0.5f));


    Yields



    1.0


    If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.





    My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.






    share|improve this answer


























    • This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

      – Eric Postpischil
      Jan 2 at 12:43













    • First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

      – Eric Postpischil
      Jan 2 at 13:58













    • Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

      – Eric Postpischil
      Jan 2 at 13:59











    • @EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

      – Hulk
      Jan 2 at 14:03













    • The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

      – Eric Postpischil
      Jan 2 at 14:03
















    1












    1








    1







    Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.



    However, the result of this analysis would not be a useful value for an epsilon.



    Consider the following example:



    float a = 1.0E-37f;
    float b = Float.MIN_VALUE;

    System.out.println((a+b) == a);
    System.out.println(a);
    System.out.println(1.0f - a);


    This prints:



    true
    1.0E-37
    1.0


    So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.



    This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0 when using infinite precision is actually 1.0f - obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.





    Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.



    As @EricPostpischil mentions in the comments,




    the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).




    As you can see from this formula, it takes a large value of n to come to an error of 1.0. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).



    Putting in some numbers:



    int n = 33554433;
    System.out.println((n-1)*0.5*Math.ulp(0.5f));


    Yields



    1.0


    If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.





    My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.






    share|improve this answer















    Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.



    However, the result of this analysis would not be a useful value for an epsilon.



    Consider the following example:



    float a = 1.0E-37f;
    float b = Float.MIN_VALUE;

    System.out.println((a+b) == a);
    System.out.println(a);
    System.out.println(1.0f - a);


    This prints:



    true
    1.0E-37
    1.0


    So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.



    This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0 when using infinite precision is actually 1.0f - obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.





    Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.



    As @EricPostpischil mentions in the comments,




    the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).




    As you can see from this formula, it takes a large value of n to come to an error of 1.0. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).



    Putting in some numbers:



    int n = 33554433;
    System.out.println((n-1)*0.5*Math.ulp(0.5f));


    Yields



    1.0


    If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.





    My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 2 at 13:59

























    answered Jan 2 at 8:45









    HulkHulk

    3,42812142




    3,42812142













    • This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

      – Eric Postpischil
      Jan 2 at 12:43













    • First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

      – Eric Postpischil
      Jan 2 at 13:58













    • Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

      – Eric Postpischil
      Jan 2 at 13:59











    • @EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

      – Hulk
      Jan 2 at 14:03













    • The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

      – Eric Postpischil
      Jan 2 at 14:03





















    • This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

      – Eric Postpischil
      Jan 2 at 12:43













    • First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

      – Eric Postpischil
      Jan 2 at 13:58













    • Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

      – Eric Postpischil
      Jan 2 at 13:59











    • @EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

      – Hulk
      Jan 2 at 14:03













    • The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

      – Eric Postpischil
      Jan 2 at 14:03



















    This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

    – Eric Postpischil
    Jan 2 at 12:43







    This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition.

    – Eric Postpischil
    Jan 2 at 12:43















    First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

    – Eric Postpischil
    Jan 2 at 13:58







    First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect.

    – Eric Postpischil
    Jan 2 at 13:58















    Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

    – Eric Postpischil
    Jan 2 at 13:59





    Yes, given some positive numbers a and b, the computed sum of a+b might be a, and so the error is 100% of b. But that is not what the question asks.

    – Eric Postpischil
    Jan 2 at 13:59













    @EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

    – Hulk
    Jan 2 at 14:03







    @EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general".

    – Hulk
    Jan 2 at 14:03















    The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

    – Eric Postpischil
    Jan 2 at 14:03







    The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift.

    – Eric Postpischil
    Jan 2 at 14:03















    0














    This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.



    In this answer, I use the conditions:




    • Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

    • We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.


    (Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)



    Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)



    Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)



    Ordering by least bit is exact.



    Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.



    Suppose we use this algorithm to sum the numbers:




    • Let S be a set containing the numbers in the sequence.

    • Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

    • Replace a and b in S with the computed sum of a and b.

    • Repeat until S contains one element.


    The number remaining in S is exactly 1, with no error.



    Proof:




    • If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

    • The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

    • So all arithmetic operations in the algorithm are exact, so the final sum is exact.


    Sorting by magnitude



    Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.



    A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).



    This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).



    Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.



    Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.



    Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.



    Arbitrary ordering



    Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.



    Discussion



    The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.



    The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.






    share|improve this answer




























      0














      This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.



      In this answer, I use the conditions:




      • Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

      • We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.


      (Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)



      Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)



      Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)



      Ordering by least bit is exact.



      Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.



      Suppose we use this algorithm to sum the numbers:




      • Let S be a set containing the numbers in the sequence.

      • Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

      • Replace a and b in S with the computed sum of a and b.

      • Repeat until S contains one element.


      The number remaining in S is exactly 1, with no error.



      Proof:




      • If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

      • The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

      • So all arithmetic operations in the algorithm are exact, so the final sum is exact.


      Sorting by magnitude



      Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.



      A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).



      This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).



      Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.



      Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.



      Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.



      Arbitrary ordering



      Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.



      Discussion



      The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.



      The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.






      share|improve this answer


























        0












        0








        0







        This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.



        In this answer, I use the conditions:




        • Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

        • We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.


        (Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)



        Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)



        Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)



        Ordering by least bit is exact.



        Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.



        Suppose we use this algorithm to sum the numbers:




        • Let S be a set containing the numbers in the sequence.

        • Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

        • Replace a and b in S with the computed sum of a and b.

        • Repeat until S contains one element.


        The number remaining in S is exactly 1, with no error.



        Proof:




        • If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

        • The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

        • So all arithmetic operations in the algorithm are exact, so the final sum is exact.


        Sorting by magnitude



        Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.



        A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).



        This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).



        Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.



        Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.



        Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.



        Arbitrary ordering



        Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.



        Discussion



        The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.



        The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.






        share|improve this answer













        This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.



        In this answer, I use the conditions:




        • Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.

        • We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.


        (Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)



        Definition: 𝜀 is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)



        Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½𝜀. In a directed rounding mode, such as toward-infinity or toward-zero, u is 𝜀. (Currently, I am not otherwise considering directed rounding modes.)



        Ordering by least bit is exact.



        Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.



        Suppose we use this algorithm to sum the numbers:




        • Let S be a set containing the numbers in the sequence.

        • Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.

        • Replace a and b in S with the computed sum of a and b.

        • Repeat until S contains one element.


        The number remaining in S is exactly 1, with no error.



        Proof:




        • If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.

        • The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.

        • So all arithmetic operations in the algorithm are exact, so the final sum is exact.


        Sorting by magnitude



        Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.



        A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).



        This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).



        Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.



        Thus, in adding a million numbers, we can expect a maximum error around 250,000 𝜀. (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.



        Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.



        Arbitrary ordering



        Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.



        Discussion



        The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.



        The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 3 at 21:53









        Eric PostpischilEric Postpischil

        78.9k886166




        78.9k886166






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003108%2fwhat-is-a-suitable-tolerance-for-expecting-n-floats-to-sum-to-1%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

            How to fix TextFormField cause rebuild widget in Flutter