What would be a good mathematical model to measure the degree of homogeneity of a mixture?
$begingroup$
At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):
We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,
A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:
Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue
in this example.
Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.
Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?
Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?
Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a
is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.
combinatorics matrices vectors mathematical-modeling
$endgroup$
|
show 8 more comments
$begingroup$
At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):
We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,
A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:
Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue
in this example.
Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.
Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?
Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?
Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a
is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.
combinatorics matrices vectors mathematical-modeling
$endgroup$
$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36
$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38
$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23
$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29
$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43
|
show 8 more comments
$begingroup$
At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):
We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,
A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:
Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue
in this example.
Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.
Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?
Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?
Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a
is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.
combinatorics matrices vectors mathematical-modeling
$endgroup$
At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):
We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,
A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:
Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue
in this example.
Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.
Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?
Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?
Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a
is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.
combinatorics matrices vectors mathematical-modeling
combinatorics matrices vectors mathematical-modeling
edited Jan 9 at 23:42
Omnomnomnom
128k791181
128k791181
asked Jan 9 at 22:08
PhDPhD
1,03651830
1,03651830
$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36
$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38
$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23
$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29
$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43
|
show 8 more comments
$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36
$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38
$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23
$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29
$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43
$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36
$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36
$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38
$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38
$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23
$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23
$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29
$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29
$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43
$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43
|
show 8 more comments
1 Answer
1
active
oldest
votes
$begingroup$
One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:
Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
$$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$
For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.
Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.
If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.
What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where
$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$
Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:
$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$
The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.
d <- zero-vector with number_of_runs components
for r in 1...number_of_runs{
for p in 1....number_of_parts{
for b in 1...number_of_buckets{
select a color from C_b (uniformly)
assign that color to X_pb
}
}
calculate discrepancy d_r
d[r] <- d_r
}
Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:
$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$
This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3068002%2fwhat-would-be-a-good-mathematical-model-to-measure-the-degree-of-homogeneity-of%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:
Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
$$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$
For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.
Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.
If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.
What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where
$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$
Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:
$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$
The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.
d <- zero-vector with number_of_runs components
for r in 1...number_of_runs{
for p in 1....number_of_parts{
for b in 1...number_of_buckets{
select a color from C_b (uniformly)
assign that color to X_pb
}
}
calculate discrepancy d_r
d[r] <- d_r
}
Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:
$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$
This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.
$endgroup$
add a comment |
$begingroup$
One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:
Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
$$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$
For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.
Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.
If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.
What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where
$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$
Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:
$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$
The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.
d <- zero-vector with number_of_runs components
for r in 1...number_of_runs{
for p in 1....number_of_parts{
for b in 1...number_of_buckets{
select a color from C_b (uniformly)
assign that color to X_pb
}
}
calculate discrepancy d_r
d[r] <- d_r
}
Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:
$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$
This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.
$endgroup$
add a comment |
$begingroup$
One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:
Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
$$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$
For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.
Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.
If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.
What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where
$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$
Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:
$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$
The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.
d <- zero-vector with number_of_runs components
for r in 1...number_of_runs{
for p in 1....number_of_parts{
for b in 1...number_of_buckets{
select a color from C_b (uniformly)
assign that color to X_pb
}
}
calculate discrepancy d_r
d[r] <- d_r
}
Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:
$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$
This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.
$endgroup$
One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:
Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
$$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$
For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.
Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.
If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.
What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where
$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$
Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:
$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$
The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.
d <- zero-vector with number_of_runs components
for r in 1...number_of_runs{
for p in 1....number_of_parts{
for b in 1...number_of_buckets{
select a color from C_b (uniformly)
assign that color to X_pb
}
}
calculate discrepancy d_r
d[r] <- d_r
}
Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:
$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$
This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.
edited Jan 13 at 3:16
answered Jan 13 at 2:53
BeyBey
1464
1464
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3068002%2fwhat-would-be-a-good-mathematical-model-to-measure-the-degree-of-homogeneity-of%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36
$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38
$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23
$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29
$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43