What would be a good mathematical model to measure the degree of homogeneity of a mixture?












3












$begingroup$


At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):



We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,



A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:



Matrix representation



Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue in this example.



Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.



Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?



Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?



Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.










share|cite|improve this question











$endgroup$












  • $begingroup$
    Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:36










  • $begingroup$
    Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:38










  • $begingroup$
    For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
    $endgroup$
    – PhD
    Jan 9 at 23:23










  • $begingroup$
    @Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
    $endgroup$
    – PhD
    Jan 9 at 23:29












  • $begingroup$
    Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
    $endgroup$
    – Omnomnomnom
    Jan 9 at 23:43
















3












$begingroup$


At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):



We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,



A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:



Matrix representation



Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue in this example.



Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.



Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?



Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?



Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.










share|cite|improve this question











$endgroup$












  • $begingroup$
    Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:36










  • $begingroup$
    Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:38










  • $begingroup$
    For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
    $endgroup$
    – PhD
    Jan 9 at 23:23










  • $begingroup$
    @Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
    $endgroup$
    – PhD
    Jan 9 at 23:29












  • $begingroup$
    Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
    $endgroup$
    – Omnomnomnom
    Jan 9 at 23:43














3












3








3


1



$begingroup$


At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):



We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,



A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:



Matrix representation



Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue in this example.



Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.



Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?



Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?



Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.










share|cite|improve this question











$endgroup$




At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):



We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,



A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:



Matrix representation



Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue in this example.



Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.



Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?



Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?



Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.







combinatorics matrices vectors mathematical-modeling






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Jan 9 at 23:42









Omnomnomnom

128k791181




128k791181










asked Jan 9 at 22:08









PhDPhD

1,03651830




1,03651830












  • $begingroup$
    Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:36










  • $begingroup$
    Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:38










  • $begingroup$
    For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
    $endgroup$
    – PhD
    Jan 9 at 23:23










  • $begingroup$
    @Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
    $endgroup$
    – PhD
    Jan 9 at 23:29












  • $begingroup$
    Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
    $endgroup$
    – Omnomnomnom
    Jan 9 at 23:43


















  • $begingroup$
    Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:36










  • $begingroup$
    Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
    $endgroup$
    – Omnomnomnom
    Jan 9 at 22:38










  • $begingroup$
    For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
    $endgroup$
    – PhD
    Jan 9 at 23:23










  • $begingroup$
    @Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
    $endgroup$
    – PhD
    Jan 9 at 23:29












  • $begingroup$
    Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
    $endgroup$
    – Omnomnomnom
    Jan 9 at 23:43
















$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36




$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36












$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38




$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38












$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23




$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23












$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29






$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29














$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43




$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43










1 Answer
1






active

oldest

votes


















1












$begingroup$

One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:



Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
$$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$



For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.



Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.



If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.



What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where



$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$



Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:



$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$



The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.



d <- zero-vector with number_of_runs components
for r in 1...number_of_runs{
for p in 1....number_of_parts{
for b in 1...number_of_buckets{
select a color from C_b (uniformly)
assign that color to X_pb
}
}
calculate discrepancy d_r
d[r] <- d_r
}


Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:



$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$



This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.






share|cite|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3068002%2fwhat-would-be-a-good-mathematical-model-to-measure-the-degree-of-homogeneity-of%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:



    Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
    $$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$



    For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.



    Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.



    If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.



    What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where



    $$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$



    Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:



    $$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$



    The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.



    d <- zero-vector with number_of_runs components
    for r in 1...number_of_runs{
    for p in 1....number_of_parts{
    for b in 1...number_of_buckets{
    select a color from C_b (uniformly)
    assign that color to X_pb
    }
    }
    calculate discrepancy d_r
    d[r] <- d_r
    }


    Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:



    $$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$



    This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.






    share|cite|improve this answer











    $endgroup$


















      1












      $begingroup$

      One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:



      Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
      $$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$



      For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.



      Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.



      If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.



      What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where



      $$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$



      Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:



      $$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$



      The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.



      d <- zero-vector with number_of_runs components
      for r in 1...number_of_runs{
      for p in 1....number_of_parts{
      for b in 1...number_of_buckets{
      select a color from C_b (uniformly)
      assign that color to X_pb
      }
      }
      calculate discrepancy d_r
      d[r] <- d_r
      }


      Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:



      $$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$



      This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.






      share|cite|improve this answer











      $endgroup$
















        1












        1








        1





        $begingroup$

        One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:



        Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
        $$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$



        For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.



        Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.



        If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.



        What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where



        $$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$



        Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:



        $$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$



        The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.



        d <- zero-vector with number_of_runs components
        for r in 1...number_of_runs{
        for p in 1....number_of_parts{
        for b in 1...number_of_buckets{
        select a color from C_b (uniformly)
        assign that color to X_pb
        }
        }
        calculate discrepancy d_r
        d[r] <- d_r
        }


        Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:



        $$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$



        This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.






        share|cite|improve this answer











        $endgroup$



        One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:



        Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
        $$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$



        For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.



        Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.



        If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.



        What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where



        $$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$



        Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:



        $$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$



        The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.



        d <- zero-vector with number_of_runs components
        for r in 1...number_of_runs{
        for p in 1....number_of_parts{
        for b in 1...number_of_buckets{
        select a color from C_b (uniformly)
        assign that color to X_pb
        }
        }
        calculate discrepancy d_r
        d[r] <- d_r
        }


        Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:



        $$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$



        This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Jan 13 at 3:16

























        answered Jan 13 at 2:53









        BeyBey

        1464




        1464






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3068002%2fwhat-would-be-a-good-mathematical-model-to-measure-the-degree-of-homogeneity-of%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith