What would be a good mathematical model to measure the degree of homogeneity of a mixture?

At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):

We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,

A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:

Matrix representation

Problem: Compute how similar (or dissimilar) the parts in a batch are. The image above is a batch of 5 parts. The individual values would be categorical variables like red, green, blue in this example.

Explanation: Similarity is something we can define with regards to color. So if all parts have the same "row of data": $a-p-m-e-i$ for Part1 above, we say the homogeneity is 100% (or heterogeneity is 0%). And if each part is made by picking a unique piece from the buckets we say homogeneity is 0% (or heterogeneity is 100%). Everything else is somewhere in between and that is the measure I'm trying to come up with, for a particular batch.

Current Idea: Treat this like a vector problem: We have 2 vectors representing the 0-homog and 100-homog points. Given a batch we compute another vector V and see how close it is to the 0-homog and how far from the 100-homog vectors (i.e., imagine point placed on a line segment between two endpoints). We only need a metric for homogeneity in a particular batch. Would this be a mathematically accurate way of computing similarity? Are there alternate ways? Existing references?

Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?

Update: A simple "count" of each element in a column should provide a number on how many different types of color are used.So $ncdot a$ implies only a is used, but $frac{n}{3}a+frac{n}{3}b+frac{n}{3}c$ would be the "ideal" heterogeneity vector i.e., since it's distributed across three value ranges.

edited Jan 9 at 23:42

Omnomnomnom

128k791181

asked Jan 9 at 22:08

PhD

1,03651830

$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36

$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38

$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23

$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29

$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43

|
show 8 more comments

At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):

We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,

A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:

Matrix representation

Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?

edited Jan 9 at 23:42

Omnomnomnom

128k791181

asked Jan 9 at 22:08

PhD

1,03651830

$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36

$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38

$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23

$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29

$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43

|
show 8 more comments

At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):

We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,

A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:

Matrix representation

Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?

edited Jan 9 at 23:42

Omnomnomnom

128k791181

asked Jan 9 at 22:08

PhD

1,03651830

At my current workplace, we are looking to quantify a batch to say how "similar/dissimilar" the items are. The problem can be stated like so (transformed for public posting):

We have parts that can be assembled from different "buckets". Each bucket can have different colored pieces of the same type e.g., colored: triangular or circular pieces, cubes or cylinders etc.,

A particular part is assembled by picking pieces from each bucket. For simplicity we may assume that pieces are picked from all buckets. It looks something like this:

Matrix representation

Extension: The above won't work if we have parts that are only assembled from a subset of the buckets in a batch. What modification could be done to allow for this scenario?

combinatorics matrices vectors mathematical-modeling

edited Jan 9 at 23:42

Omnomnomnom

128k791181

asked Jan 9 at 22:08

PhD

1,03651830

edited Jan 9 at 23:42

Omnomnomnom

128k791181

asked Jan 9 at 22:08

PhD

1,03651830

edited Jan 9 at 23:42

Omnomnomnom

128k791181

edited Jan 9 at 23:42

Omnomnomnom

128k791181

edited Jan 9 at 23:42

Omnomnomnom

128k791181

asked Jan 9 at 22:08

PhD

1,03651830

asked Jan 9 at 22:08

PhD

1,03651830

asked Jan 9 at 22:08

PhD

1,03651830

$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36

$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38

$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23

$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29

$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43

|
show 8 more comments

$begingroup$
Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?
$endgroup$
– Omnomnomnom
Jan 9 at 22:36

$begingroup$
Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?
$endgroup$
– Omnomnomnom
Jan 9 at 22:38

$begingroup$
For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.
$endgroup$
– PhD
Jan 9 at 23:23

$begingroup$
@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)
$endgroup$
– PhD
Jan 9 at 23:29

$begingroup$
Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part
$endgroup$
– Omnomnomnom
Jan 9 at 23:43

Is the sum of any row constant? Can we think of the entries as being percentages (i.e. the percent of part $i$ built from bucket $j$)? If not, how should the rows a-p-m-e-i and 2a-2p-2m-2e-2i compare in terms of your intuitive idea of homogeneity?

– Omnomnomnom
Jan 9 at 22:36

Is there a concrete goal of homogeneity, i.e. a condition/outcome that we're trying to optimize?

– Omnomnomnom
Jan 9 at 22:38

For the sum: unfortunately no. We can't rely on that. These "values" are mapped to a "count" and not really numeric so to speak. They're more like categorical variables. Goal: We completely lack this measure but we seem to talk about it a lot and everyone seems to use their intuition to understand what it means and I was hoping to provide a more concrete metric.

– PhD
Jan 9 at 23:23

@Omnomnomnom - See update. We are not going to be optimizing for anything with the value though. Everyone talks about it and they probably understand what they mean but no one in the industry has a metric to ground this more rigorously. I was curious if there is something I could leverage from the MathLand :)

– PhD
Jan 9 at 23:29

Regarding your extension, one idea is to include "N/A" as a color, which is to say that the bucket is not used for the part

– Omnomnomnom
Jan 9 at 23:43

|
show 8 more comments

1 Answer
1

active

oldest

votes

One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:

Let $b in B$ be a given bucket, and $C_b$ be the set of colors available for components from that bucket (including $N/A$ per a suggestion in the comments). If we define $X_{ib}$ as the selected color for the part $i$ from bucket $b$, then
$$P(X_{ib} = c in C_b)sim text{DiscreteUniform(C_b)} implies P(X_{ib} = c in C_b) = frac{1}{|C_b|}$$

For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.

Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.

If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.

What we want to test is if the observed distribution of colors among parts for all buckets is consistent with the null hypothesis. We can represent "expectation" by noting that the expected number of times a particular color $c$ is chosen from bucket $b$ (i.e., $e_{cb}$) is $frac{N}{|C_b|}$. This will give us the expected number of times each bucket-color combination should occur among our $N$ parts (e.g., red-cylinder). The observed number of times a given bucket-color combination occurs is $O_{bc}$, where

$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$

Similar to a chi-square goodness of fit test, we can quantify the discrepancy of the observations from our expectations using a deviation statistic $d_{bc}$. For example, $d_{bc} = |e_{bc} - O_{bc}|$. The total deviation $d$ can be the sum of the deviations for each bucket-color combination:

$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$

The tricky part is determining the probability of different values of $d$ under our null hypothesis. I don't know if there is a nice mathematical formula, but you can get this computationally (to a high degree of accuracy) using simulation. The following pseudocode will help you approximate the null distribution of $d$.

d <- zero-vector with number_of_runs components

for r in 1...number_of_runs{   

  for p in 1....number_of_parts{         

    for b in 1...number_of_buckets{

       select a color from C_b (uniformly)

       assign that color to X_pb

    }

  }

  calculate discrepancy d_r

  d[r] <- d_r

}

Now that you have that, we can define the "homogeneity" of your actual sample as $1$ minus the p-value of the test of whether the color assignments were drawn uniformly (max heterogeneity). If we let $hat{d}$ be the observed total discrepancy of our sample:

$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$

This has the property of being between $eta$ and $1$, where $eta = P_{H_0}(d = d_{text{min}})$ with $eta$ indicating maximum heterogeneity and $1$ being max homogeneity. Of course, you can translate by $eta$ and scale by $1-eta$ to get it back to a normalized scale of $0$ to $1$, but the un-scaled version will allow one to measure the absolute heterogeneity of samples [in some sense]. Larger numbers of parts, buckets, and/or colors allow greater heterogeneity such that $eta to 0$ as the number of choices/parts increases.

edited Jan 13 at 3:16

answered Jan 13 at 2:53

Bey

1464

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3068002%2fwhat-would-be-a-good-mathematical-model-to-measure-the-degree-of-homogeneity-of%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:

For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.

Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.

If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.

$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$

$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$

d <- zero-vector with number_of_runs components

for r in 1...number_of_runs{   

  for p in 1....number_of_parts{         

    for b in 1...number_of_buckets{

       select a color from C_b (uniformly)

       assign that color to X_pb

    }

  }

  calculate discrepancy d_r

  d[r] <- d_r

}

$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$

edited Jan 13 at 3:16

answered Jan 13 at 2:53

Bey

1464

add a comment |

One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:

For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.

Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.

If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.

$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$

$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$

d <- zero-vector with number_of_runs components

for r in 1...number_of_runs{   

  for p in 1....number_of_parts{         

    for b in 1...number_of_buckets{

       select a color from C_b (uniformly)

       assign that color to X_pb

    }

  }

  calculate discrepancy d_r

  d[r] <- d_r

}

$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$

edited Jan 13 at 3:16

answered Jan 13 at 2:53

Bey

1464

add a comment |

One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:

For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.

Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.

If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.

$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$

$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$

d <- zero-vector with number_of_runs components

for r in 1...number_of_runs{   

  for p in 1....number_of_parts{         

    for b in 1...number_of_buckets{

       select a color from C_b (uniformly)

       assign that color to X_pb

    }

  }

  calculate discrepancy d_r

  d[r] <- d_r

}

$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$

edited Jan 13 at 3:16

answered Jan 13 at 2:53

Bey

1464

One possible definition of homogeneity is to test how unlikely is is that the results would have been generated by sampling from a discrete uniform distribution for each bucket for each part:

For any given part $i$, we have the vector $X_i := (X_{ib})_{bin B}$ that records the color choices from each bucket.

Our null hypothesis $H_0$ is that the parts are constructed by selecting at random from each bucket for each part according to the Discrete Uniform distribution for that bucket.

If we have $N$ parts, then the distribution of the colors selected for a given bucket $b$ across all parts (i.e., the "column" distribution) will be a multinomial distribution.

$$O_{bc} = sum_{i}^{N} mathbf{1}_{c}(X_{ib})$$

$$ d =sum_{b in B}sum_{c in C_b} d_{bc};; text{where} ; d_{bc} = left|frac{N}{|C_b| }- O_{bc}right|$$

d <- zero-vector with number_of_runs components

for r in 1...number_of_runs{   

  for p in 1....number_of_parts{         

    for b in 1...number_of_buckets{

       select a color from C_b (uniformly)

       assign that color to X_pb

    }

  }

  calculate discrepancy d_r

  d[r] <- d_r

}

$$text{Homogeneity} = 1 - P_{H_0}(d > hat{d}) = P_{H_0}(d leq hat{d})$$

edited Jan 13 at 3:16

answered Jan 13 at 2:53

Bey

1464

edited Jan 13 at 3:16

answered Jan 13 at 2:53

Bey

1464

answered Jan 13 at 2:53

Bey

1464

answered Jan 13 at 2:53

Bey

1464

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu