Why do we refer to the denominator of Bayes' theorem as “marginal probability”?
$begingroup$
Consider the following characterization of the Bayes' theorem:
Bayes' Theorem
Given some observed data $x$, the posterior probability that the paramater $Theta$ has the value $theta$ is $p(theta mid x) = p(x mid theta) p (theta) / p(x)$, where $p(x mid theta)$ is the likelihood, $p(theta)$ is the prior probability of the value $theta$, and $p(x)$ is the marginal probability of the value $x$.
Is there any special reason why we call $p(x)$ the "marginal probability"? What is "marginal" about it?
probability probability-theory bayes-theorem
$endgroup$
add a comment |
$begingroup$
Consider the following characterization of the Bayes' theorem:
Bayes' Theorem
Given some observed data $x$, the posterior probability that the paramater $Theta$ has the value $theta$ is $p(theta mid x) = p(x mid theta) p (theta) / p(x)$, where $p(x mid theta)$ is the likelihood, $p(theta)$ is the prior probability of the value $theta$, and $p(x)$ is the marginal probability of the value $x$.
Is there any special reason why we call $p(x)$ the "marginal probability"? What is "marginal" about it?
probability probability-theory bayes-theorem
$endgroup$
1
$begingroup$
"Marginal" does not mean "barely making the grade" but that the probability has been derived from a joint probability. The numerator is this instance is $p(xmid theta)p(theta) = p(x,theta)$ which is the joint probability and $p(x)$ (as well as $p(theta)$) are marginal probabilities of they are derived from $p(x,theta)$.
$endgroup$
– Dilip Sarwate
Jun 26 '15 at 2:40
$begingroup$
"Marginal probability" means the same thing here that it means in other contexts, i.e. "unconditional". (The meaning is context-dependent, since all probabilities are conditional.) $qquad$
$endgroup$
– Michael Hardy
Sep 27 '16 at 7:20
add a comment |
$begingroup$
Consider the following characterization of the Bayes' theorem:
Bayes' Theorem
Given some observed data $x$, the posterior probability that the paramater $Theta$ has the value $theta$ is $p(theta mid x) = p(x mid theta) p (theta) / p(x)$, where $p(x mid theta)$ is the likelihood, $p(theta)$ is the prior probability of the value $theta$, and $p(x)$ is the marginal probability of the value $x$.
Is there any special reason why we call $p(x)$ the "marginal probability"? What is "marginal" about it?
probability probability-theory bayes-theorem
$endgroup$
Consider the following characterization of the Bayes' theorem:
Bayes' Theorem
Given some observed data $x$, the posterior probability that the paramater $Theta$ has the value $theta$ is $p(theta mid x) = p(x mid theta) p (theta) / p(x)$, where $p(x mid theta)$ is the likelihood, $p(theta)$ is the prior probability of the value $theta$, and $p(x)$ is the marginal probability of the value $x$.
Is there any special reason why we call $p(x)$ the "marginal probability"? What is "marginal" about it?
probability probability-theory bayes-theorem
probability probability-theory bayes-theorem
edited Jan 18 at 23:52
nbro
2,41663174
2,41663174
asked Jun 26 '15 at 2:34
PP121PP121
746
746
1
$begingroup$
"Marginal" does not mean "barely making the grade" but that the probability has been derived from a joint probability. The numerator is this instance is $p(xmid theta)p(theta) = p(x,theta)$ which is the joint probability and $p(x)$ (as well as $p(theta)$) are marginal probabilities of they are derived from $p(x,theta)$.
$endgroup$
– Dilip Sarwate
Jun 26 '15 at 2:40
$begingroup$
"Marginal probability" means the same thing here that it means in other contexts, i.e. "unconditional". (The meaning is context-dependent, since all probabilities are conditional.) $qquad$
$endgroup$
– Michael Hardy
Sep 27 '16 at 7:20
add a comment |
1
$begingroup$
"Marginal" does not mean "barely making the grade" but that the probability has been derived from a joint probability. The numerator is this instance is $p(xmid theta)p(theta) = p(x,theta)$ which is the joint probability and $p(x)$ (as well as $p(theta)$) are marginal probabilities of they are derived from $p(x,theta)$.
$endgroup$
– Dilip Sarwate
Jun 26 '15 at 2:40
$begingroup$
"Marginal probability" means the same thing here that it means in other contexts, i.e. "unconditional". (The meaning is context-dependent, since all probabilities are conditional.) $qquad$
$endgroup$
– Michael Hardy
Sep 27 '16 at 7:20
1
1
$begingroup$
"Marginal" does not mean "barely making the grade" but that the probability has been derived from a joint probability. The numerator is this instance is $p(xmid theta)p(theta) = p(x,theta)$ which is the joint probability and $p(x)$ (as well as $p(theta)$) are marginal probabilities of they are derived from $p(x,theta)$.
$endgroup$
– Dilip Sarwate
Jun 26 '15 at 2:40
$begingroup$
"Marginal" does not mean "barely making the grade" but that the probability has been derived from a joint probability. The numerator is this instance is $p(xmid theta)p(theta) = p(x,theta)$ which is the joint probability and $p(x)$ (as well as $p(theta)$) are marginal probabilities of they are derived from $p(x,theta)$.
$endgroup$
– Dilip Sarwate
Jun 26 '15 at 2:40
$begingroup$
"Marginal probability" means the same thing here that it means in other contexts, i.e. "unconditional". (The meaning is context-dependent, since all probabilities are conditional.) $qquad$
$endgroup$
– Michael Hardy
Sep 27 '16 at 7:20
$begingroup$
"Marginal probability" means the same thing here that it means in other contexts, i.e. "unconditional". (The meaning is context-dependent, since all probabilities are conditional.) $qquad$
$endgroup$
– Michael Hardy
Sep 27 '16 at 7:20
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
If you consider a joint distribution to be a table of values in columns and rows with there probabilities entered in the cells, then the "marginal distribution" is found by summing the values in the table along rows (or columns) and writing the total in the margins of the table.
$$begin{array}{c c} & X \ Theta & boxed{begin{array}{c|cc|c} ~ & 0 & 1 & Xmid Theta \ hline 0 & 0.15 & 0.35 & 0.5 \ 1 & 0.20 & 0.30 & 0.5 \hline Thetamid X & 0.35 & 0.65 & ~end{array}}end{array}$$
$endgroup$
$begingroup$
Yes, though I am not sure I understand your $Theta mid X$ or $X mid Theta$. I would have thought $0.65$ was $mathbb{P}(X=1)$
$endgroup$
– Henry
Sep 27 '16 at 7:23
$begingroup$
I hope you are aware of the fact that your answer is confusing. $Theta mid X$ is often used to represent "$Theta$ given $X$" (a conditional) and not to represent a marginal. A marginal is just X or $Theta$. Furthermore, the OP asked why p(x), the denominator is called a "marginal" probability. I think that the doubt lies in the fact that p(x) is called marginal, whereas $p(theta)$ is called prior, but both can be calculated as marginals.
$endgroup$
– nbro
Jan 18 at 23:49
add a comment |
$begingroup$
To me, bayes theorem is all about inverting likelihood functions, and in that context calling it marginal probabity makes sense.
- Lets say I have a observation $c$,
- and a collection of states $mathbf{s}={s_1,ldots,s_n}$, that could be causing that observation.
- And each of those states also defines a likelihood: $P(cmid s_i)$
- as well we have a prior $P(s_i)$ (I'm assuming you have already motivated the prior, if not ask another question on this site)
- So I want to know the state, based on the variable
- If I just wanted to know the most likely state, and how they compair to each other, I could define a scoring function -- combining the likelihood of our observation given we are in the state, with the change of being in the state: $$operatorname{score}_c(s_i)= P(cmid s_i)P(s_i)$$
- Then to find the most likely state $s^star$, i would just find the argmax $$s^star = operatorname{argmax}_{forall s_i in mathbf{s}} operatorname{score}_c(s_i) = operatorname{argmax}_{forall s_i in mathbf{s}} P(cmid s_i)P(s_i) $$
- That score function is quiet nice. We can think of a score vector, which has all the scores and we can see which is the most likely, and which is the least. But it does not sum to one. We'ld like to make it sum to one -- we would normalise it and call it a probability (even if it isn't -- but it will turn out it is). Our normalised score obviously depends on $c$ so it will be $P(s_imid c)$. The normalised score is given by
$$P(s_imid c)=dfrac{operatorname{score}_c(s_i)}{sum_{forall s_jin mathbf{s}} operatorname{score}_c(s_j) } = dfrac{P(cmid s_i)P(s_i)}{sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) }$$
- the above is a very useful form of Bayes Theorem.
- let's take a closer look at the bottom line:
$$sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
- So we are summing the Joint probability, over all possible values that one of its fields can take. That is the very definition of the marginal probability of the other field.
$$P(c) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
Our bottom like -- the normalising factor to make it sum to one -- that is just the marginal probability of $c$. Substituting that back in:
$$P(s_imid c) = dfrac{P(cmid s_i)P(s_i)}{P(c)}$$
So the bottom line $P(c)$ was just a marginally probability, that we find by summing over all possible values for the other field ($s_i$) in the top line.
$endgroup$
$begingroup$
Superb explanation. If you (or anyone else) could provide a motivation for the prior, i'd be grateful.
$endgroup$
– blz
Nov 2 '18 at 14:52
$begingroup$
Please ask a separate question and link back to this QA
$endgroup$
– Lyndon White
Nov 3 '18 at 15:23
add a comment |
$begingroup$
The explanation I was given when I was taught conditional probabilities is that if you draw up a table of the probabilities $p(x,y)$, then the row/column sums
$$ p(x) = sum_{y} p(x,y) $$
(by the law of total probability) are written in the margins of the table.
$endgroup$
$begingroup$
By $p(x,y)$ do you just mean $p(x wedge y)$ (i.e., the probability of $x$ and $y$ co-occurring)?
$endgroup$
– PP121
Jun 26 '15 at 2:42
1
$begingroup$
@PP121 Yes. It's an abbreviation for the joint probability. More specifically $p_{X,Y}(x,y) = mathsf P(X=x cap Y=y)$.
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:44
$begingroup$
So then what is analogous to $X$ and $Y$ in the original example I wrote out? Is it $X = {ldots x ldots}$ and $Theta = {ldots theta ldots }$?
$endgroup$
– PP121
Jun 26 '15 at 2:49
1
$begingroup$
Yes. @PP121 That would be so. $p(xmid theta) = P(X=xmid Theta=theta)$
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:51
1
$begingroup$
@PP121 No; it is literal, at least for discrete random variables. $X$ being a discrete random variable means that, on inspection, it will be found to have one of the values within the sample space with a certain probability. For continuous random variables the appropriate measure is a probability density and things are somewhat more involved, but mostly the same principles apply.
$endgroup$
– Graham Kemp
Jun 26 '15 at 3:56
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1339666%2fwhy-do-we-refer-to-the-denominator-of-bayes-theorem-as-marginal-probability%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If you consider a joint distribution to be a table of values in columns and rows with there probabilities entered in the cells, then the "marginal distribution" is found by summing the values in the table along rows (or columns) and writing the total in the margins of the table.
$$begin{array}{c c} & X \ Theta & boxed{begin{array}{c|cc|c} ~ & 0 & 1 & Xmid Theta \ hline 0 & 0.15 & 0.35 & 0.5 \ 1 & 0.20 & 0.30 & 0.5 \hline Thetamid X & 0.35 & 0.65 & ~end{array}}end{array}$$
$endgroup$
$begingroup$
Yes, though I am not sure I understand your $Theta mid X$ or $X mid Theta$. I would have thought $0.65$ was $mathbb{P}(X=1)$
$endgroup$
– Henry
Sep 27 '16 at 7:23
$begingroup$
I hope you are aware of the fact that your answer is confusing. $Theta mid X$ is often used to represent "$Theta$ given $X$" (a conditional) and not to represent a marginal. A marginal is just X or $Theta$. Furthermore, the OP asked why p(x), the denominator is called a "marginal" probability. I think that the doubt lies in the fact that p(x) is called marginal, whereas $p(theta)$ is called prior, but both can be calculated as marginals.
$endgroup$
– nbro
Jan 18 at 23:49
add a comment |
$begingroup$
If you consider a joint distribution to be a table of values in columns and rows with there probabilities entered in the cells, then the "marginal distribution" is found by summing the values in the table along rows (or columns) and writing the total in the margins of the table.
$$begin{array}{c c} & X \ Theta & boxed{begin{array}{c|cc|c} ~ & 0 & 1 & Xmid Theta \ hline 0 & 0.15 & 0.35 & 0.5 \ 1 & 0.20 & 0.30 & 0.5 \hline Thetamid X & 0.35 & 0.65 & ~end{array}}end{array}$$
$endgroup$
$begingroup$
Yes, though I am not sure I understand your $Theta mid X$ or $X mid Theta$. I would have thought $0.65$ was $mathbb{P}(X=1)$
$endgroup$
– Henry
Sep 27 '16 at 7:23
$begingroup$
I hope you are aware of the fact that your answer is confusing. $Theta mid X$ is often used to represent "$Theta$ given $X$" (a conditional) and not to represent a marginal. A marginal is just X or $Theta$. Furthermore, the OP asked why p(x), the denominator is called a "marginal" probability. I think that the doubt lies in the fact that p(x) is called marginal, whereas $p(theta)$ is called prior, but both can be calculated as marginals.
$endgroup$
– nbro
Jan 18 at 23:49
add a comment |
$begingroup$
If you consider a joint distribution to be a table of values in columns and rows with there probabilities entered in the cells, then the "marginal distribution" is found by summing the values in the table along rows (or columns) and writing the total in the margins of the table.
$$begin{array}{c c} & X \ Theta & boxed{begin{array}{c|cc|c} ~ & 0 & 1 & Xmid Theta \ hline 0 & 0.15 & 0.35 & 0.5 \ 1 & 0.20 & 0.30 & 0.5 \hline Thetamid X & 0.35 & 0.65 & ~end{array}}end{array}$$
$endgroup$
If you consider a joint distribution to be a table of values in columns and rows with there probabilities entered in the cells, then the "marginal distribution" is found by summing the values in the table along rows (or columns) and writing the total in the margins of the table.
$$begin{array}{c c} & X \ Theta & boxed{begin{array}{c|cc|c} ~ & 0 & 1 & Xmid Theta \ hline 0 & 0.15 & 0.35 & 0.5 \ 1 & 0.20 & 0.30 & 0.5 \hline Thetamid X & 0.35 & 0.65 & ~end{array}}end{array}$$
edited Jun 26 '15 at 2:49
answered Jun 26 '15 at 2:39


Graham KempGraham Kemp
86.2k43478
86.2k43478
$begingroup$
Yes, though I am not sure I understand your $Theta mid X$ or $X mid Theta$. I would have thought $0.65$ was $mathbb{P}(X=1)$
$endgroup$
– Henry
Sep 27 '16 at 7:23
$begingroup$
I hope you are aware of the fact that your answer is confusing. $Theta mid X$ is often used to represent "$Theta$ given $X$" (a conditional) and not to represent a marginal. A marginal is just X or $Theta$. Furthermore, the OP asked why p(x), the denominator is called a "marginal" probability. I think that the doubt lies in the fact that p(x) is called marginal, whereas $p(theta)$ is called prior, but both can be calculated as marginals.
$endgroup$
– nbro
Jan 18 at 23:49
add a comment |
$begingroup$
Yes, though I am not sure I understand your $Theta mid X$ or $X mid Theta$. I would have thought $0.65$ was $mathbb{P}(X=1)$
$endgroup$
– Henry
Sep 27 '16 at 7:23
$begingroup$
I hope you are aware of the fact that your answer is confusing. $Theta mid X$ is often used to represent "$Theta$ given $X$" (a conditional) and not to represent a marginal. A marginal is just X or $Theta$. Furthermore, the OP asked why p(x), the denominator is called a "marginal" probability. I think that the doubt lies in the fact that p(x) is called marginal, whereas $p(theta)$ is called prior, but both can be calculated as marginals.
$endgroup$
– nbro
Jan 18 at 23:49
$begingroup$
Yes, though I am not sure I understand your $Theta mid X$ or $X mid Theta$. I would have thought $0.65$ was $mathbb{P}(X=1)$
$endgroup$
– Henry
Sep 27 '16 at 7:23
$begingroup$
Yes, though I am not sure I understand your $Theta mid X$ or $X mid Theta$. I would have thought $0.65$ was $mathbb{P}(X=1)$
$endgroup$
– Henry
Sep 27 '16 at 7:23
$begingroup$
I hope you are aware of the fact that your answer is confusing. $Theta mid X$ is often used to represent "$Theta$ given $X$" (a conditional) and not to represent a marginal. A marginal is just X or $Theta$. Furthermore, the OP asked why p(x), the denominator is called a "marginal" probability. I think that the doubt lies in the fact that p(x) is called marginal, whereas $p(theta)$ is called prior, but both can be calculated as marginals.
$endgroup$
– nbro
Jan 18 at 23:49
$begingroup$
I hope you are aware of the fact that your answer is confusing. $Theta mid X$ is often used to represent "$Theta$ given $X$" (a conditional) and not to represent a marginal. A marginal is just X or $Theta$. Furthermore, the OP asked why p(x), the denominator is called a "marginal" probability. I think that the doubt lies in the fact that p(x) is called marginal, whereas $p(theta)$ is called prior, but both can be calculated as marginals.
$endgroup$
– nbro
Jan 18 at 23:49
add a comment |
$begingroup$
To me, bayes theorem is all about inverting likelihood functions, and in that context calling it marginal probabity makes sense.
- Lets say I have a observation $c$,
- and a collection of states $mathbf{s}={s_1,ldots,s_n}$, that could be causing that observation.
- And each of those states also defines a likelihood: $P(cmid s_i)$
- as well we have a prior $P(s_i)$ (I'm assuming you have already motivated the prior, if not ask another question on this site)
- So I want to know the state, based on the variable
- If I just wanted to know the most likely state, and how they compair to each other, I could define a scoring function -- combining the likelihood of our observation given we are in the state, with the change of being in the state: $$operatorname{score}_c(s_i)= P(cmid s_i)P(s_i)$$
- Then to find the most likely state $s^star$, i would just find the argmax $$s^star = operatorname{argmax}_{forall s_i in mathbf{s}} operatorname{score}_c(s_i) = operatorname{argmax}_{forall s_i in mathbf{s}} P(cmid s_i)P(s_i) $$
- That score function is quiet nice. We can think of a score vector, which has all the scores and we can see which is the most likely, and which is the least. But it does not sum to one. We'ld like to make it sum to one -- we would normalise it and call it a probability (even if it isn't -- but it will turn out it is). Our normalised score obviously depends on $c$ so it will be $P(s_imid c)$. The normalised score is given by
$$P(s_imid c)=dfrac{operatorname{score}_c(s_i)}{sum_{forall s_jin mathbf{s}} operatorname{score}_c(s_j) } = dfrac{P(cmid s_i)P(s_i)}{sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) }$$
- the above is a very useful form of Bayes Theorem.
- let's take a closer look at the bottom line:
$$sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
- So we are summing the Joint probability, over all possible values that one of its fields can take. That is the very definition of the marginal probability of the other field.
$$P(c) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
Our bottom like -- the normalising factor to make it sum to one -- that is just the marginal probability of $c$. Substituting that back in:
$$P(s_imid c) = dfrac{P(cmid s_i)P(s_i)}{P(c)}$$
So the bottom line $P(c)$ was just a marginally probability, that we find by summing over all possible values for the other field ($s_i$) in the top line.
$endgroup$
$begingroup$
Superb explanation. If you (or anyone else) could provide a motivation for the prior, i'd be grateful.
$endgroup$
– blz
Nov 2 '18 at 14:52
$begingroup$
Please ask a separate question and link back to this QA
$endgroup$
– Lyndon White
Nov 3 '18 at 15:23
add a comment |
$begingroup$
To me, bayes theorem is all about inverting likelihood functions, and in that context calling it marginal probabity makes sense.
- Lets say I have a observation $c$,
- and a collection of states $mathbf{s}={s_1,ldots,s_n}$, that could be causing that observation.
- And each of those states also defines a likelihood: $P(cmid s_i)$
- as well we have a prior $P(s_i)$ (I'm assuming you have already motivated the prior, if not ask another question on this site)
- So I want to know the state, based on the variable
- If I just wanted to know the most likely state, and how they compair to each other, I could define a scoring function -- combining the likelihood of our observation given we are in the state, with the change of being in the state: $$operatorname{score}_c(s_i)= P(cmid s_i)P(s_i)$$
- Then to find the most likely state $s^star$, i would just find the argmax $$s^star = operatorname{argmax}_{forall s_i in mathbf{s}} operatorname{score}_c(s_i) = operatorname{argmax}_{forall s_i in mathbf{s}} P(cmid s_i)P(s_i) $$
- That score function is quiet nice. We can think of a score vector, which has all the scores and we can see which is the most likely, and which is the least. But it does not sum to one. We'ld like to make it sum to one -- we would normalise it and call it a probability (even if it isn't -- but it will turn out it is). Our normalised score obviously depends on $c$ so it will be $P(s_imid c)$. The normalised score is given by
$$P(s_imid c)=dfrac{operatorname{score}_c(s_i)}{sum_{forall s_jin mathbf{s}} operatorname{score}_c(s_j) } = dfrac{P(cmid s_i)P(s_i)}{sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) }$$
- the above is a very useful form of Bayes Theorem.
- let's take a closer look at the bottom line:
$$sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
- So we are summing the Joint probability, over all possible values that one of its fields can take. That is the very definition of the marginal probability of the other field.
$$P(c) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
Our bottom like -- the normalising factor to make it sum to one -- that is just the marginal probability of $c$. Substituting that back in:
$$P(s_imid c) = dfrac{P(cmid s_i)P(s_i)}{P(c)}$$
So the bottom line $P(c)$ was just a marginally probability, that we find by summing over all possible values for the other field ($s_i$) in the top line.
$endgroup$
$begingroup$
Superb explanation. If you (or anyone else) could provide a motivation for the prior, i'd be grateful.
$endgroup$
– blz
Nov 2 '18 at 14:52
$begingroup$
Please ask a separate question and link back to this QA
$endgroup$
– Lyndon White
Nov 3 '18 at 15:23
add a comment |
$begingroup$
To me, bayes theorem is all about inverting likelihood functions, and in that context calling it marginal probabity makes sense.
- Lets say I have a observation $c$,
- and a collection of states $mathbf{s}={s_1,ldots,s_n}$, that could be causing that observation.
- And each of those states also defines a likelihood: $P(cmid s_i)$
- as well we have a prior $P(s_i)$ (I'm assuming you have already motivated the prior, if not ask another question on this site)
- So I want to know the state, based on the variable
- If I just wanted to know the most likely state, and how they compair to each other, I could define a scoring function -- combining the likelihood of our observation given we are in the state, with the change of being in the state: $$operatorname{score}_c(s_i)= P(cmid s_i)P(s_i)$$
- Then to find the most likely state $s^star$, i would just find the argmax $$s^star = operatorname{argmax}_{forall s_i in mathbf{s}} operatorname{score}_c(s_i) = operatorname{argmax}_{forall s_i in mathbf{s}} P(cmid s_i)P(s_i) $$
- That score function is quiet nice. We can think of a score vector, which has all the scores and we can see which is the most likely, and which is the least. But it does not sum to one. We'ld like to make it sum to one -- we would normalise it and call it a probability (even if it isn't -- but it will turn out it is). Our normalised score obviously depends on $c$ so it will be $P(s_imid c)$. The normalised score is given by
$$P(s_imid c)=dfrac{operatorname{score}_c(s_i)}{sum_{forall s_jin mathbf{s}} operatorname{score}_c(s_j) } = dfrac{P(cmid s_i)P(s_i)}{sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) }$$
- the above is a very useful form of Bayes Theorem.
- let's take a closer look at the bottom line:
$$sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
- So we are summing the Joint probability, over all possible values that one of its fields can take. That is the very definition of the marginal probability of the other field.
$$P(c) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
Our bottom like -- the normalising factor to make it sum to one -- that is just the marginal probability of $c$. Substituting that back in:
$$P(s_imid c) = dfrac{P(cmid s_i)P(s_i)}{P(c)}$$
So the bottom line $P(c)$ was just a marginally probability, that we find by summing over all possible values for the other field ($s_i$) in the top line.
$endgroup$
To me, bayes theorem is all about inverting likelihood functions, and in that context calling it marginal probabity makes sense.
- Lets say I have a observation $c$,
- and a collection of states $mathbf{s}={s_1,ldots,s_n}$, that could be causing that observation.
- And each of those states also defines a likelihood: $P(cmid s_i)$
- as well we have a prior $P(s_i)$ (I'm assuming you have already motivated the prior, if not ask another question on this site)
- So I want to know the state, based on the variable
- If I just wanted to know the most likely state, and how they compair to each other, I could define a scoring function -- combining the likelihood of our observation given we are in the state, with the change of being in the state: $$operatorname{score}_c(s_i)= P(cmid s_i)P(s_i)$$
- Then to find the most likely state $s^star$, i would just find the argmax $$s^star = operatorname{argmax}_{forall s_i in mathbf{s}} operatorname{score}_c(s_i) = operatorname{argmax}_{forall s_i in mathbf{s}} P(cmid s_i)P(s_i) $$
- That score function is quiet nice. We can think of a score vector, which has all the scores and we can see which is the most likely, and which is the least. But it does not sum to one. We'ld like to make it sum to one -- we would normalise it and call it a probability (even if it isn't -- but it will turn out it is). Our normalised score obviously depends on $c$ so it will be $P(s_imid c)$. The normalised score is given by
$$P(s_imid c)=dfrac{operatorname{score}_c(s_i)}{sum_{forall s_jin mathbf{s}} operatorname{score}_c(s_j) } = dfrac{P(cmid s_i)P(s_i)}{sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) }$$
- the above is a very useful form of Bayes Theorem.
- let's take a closer look at the bottom line:
$$sum_{forall s_jin mathbf{s}} P(cmid s_j)P(s_j) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
- So we are summing the Joint probability, over all possible values that one of its fields can take. That is the very definition of the marginal probability of the other field.
$$P(c) = sum_{forall s_jin mathbf{s}} P(c,s_j)$$
Our bottom like -- the normalising factor to make it sum to one -- that is just the marginal probability of $c$. Substituting that back in:
$$P(s_imid c) = dfrac{P(cmid s_i)P(s_i)}{P(c)}$$
So the bottom line $P(c)$ was just a marginally probability, that we find by summing over all possible values for the other field ($s_i$) in the top line.
edited Sep 27 '16 at 7:19
Michael Hardy
1
1
answered Sep 27 '16 at 6:41


Lyndon WhiteLyndon White
6551620
6551620
$begingroup$
Superb explanation. If you (or anyone else) could provide a motivation for the prior, i'd be grateful.
$endgroup$
– blz
Nov 2 '18 at 14:52
$begingroup$
Please ask a separate question and link back to this QA
$endgroup$
– Lyndon White
Nov 3 '18 at 15:23
add a comment |
$begingroup$
Superb explanation. If you (or anyone else) could provide a motivation for the prior, i'd be grateful.
$endgroup$
– blz
Nov 2 '18 at 14:52
$begingroup$
Please ask a separate question and link back to this QA
$endgroup$
– Lyndon White
Nov 3 '18 at 15:23
$begingroup$
Superb explanation. If you (or anyone else) could provide a motivation for the prior, i'd be grateful.
$endgroup$
– blz
Nov 2 '18 at 14:52
$begingroup$
Superb explanation. If you (or anyone else) could provide a motivation for the prior, i'd be grateful.
$endgroup$
– blz
Nov 2 '18 at 14:52
$begingroup$
Please ask a separate question and link back to this QA
$endgroup$
– Lyndon White
Nov 3 '18 at 15:23
$begingroup$
Please ask a separate question and link back to this QA
$endgroup$
– Lyndon White
Nov 3 '18 at 15:23
add a comment |
$begingroup$
The explanation I was given when I was taught conditional probabilities is that if you draw up a table of the probabilities $p(x,y)$, then the row/column sums
$$ p(x) = sum_{y} p(x,y) $$
(by the law of total probability) are written in the margins of the table.
$endgroup$
$begingroup$
By $p(x,y)$ do you just mean $p(x wedge y)$ (i.e., the probability of $x$ and $y$ co-occurring)?
$endgroup$
– PP121
Jun 26 '15 at 2:42
1
$begingroup$
@PP121 Yes. It's an abbreviation for the joint probability. More specifically $p_{X,Y}(x,y) = mathsf P(X=x cap Y=y)$.
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:44
$begingroup$
So then what is analogous to $X$ and $Y$ in the original example I wrote out? Is it $X = {ldots x ldots}$ and $Theta = {ldots theta ldots }$?
$endgroup$
– PP121
Jun 26 '15 at 2:49
1
$begingroup$
Yes. @PP121 That would be so. $p(xmid theta) = P(X=xmid Theta=theta)$
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:51
1
$begingroup$
@PP121 No; it is literal, at least for discrete random variables. $X$ being a discrete random variable means that, on inspection, it will be found to have one of the values within the sample space with a certain probability. For continuous random variables the appropriate measure is a probability density and things are somewhat more involved, but mostly the same principles apply.
$endgroup$
– Graham Kemp
Jun 26 '15 at 3:56
|
show 1 more comment
$begingroup$
The explanation I was given when I was taught conditional probabilities is that if you draw up a table of the probabilities $p(x,y)$, then the row/column sums
$$ p(x) = sum_{y} p(x,y) $$
(by the law of total probability) are written in the margins of the table.
$endgroup$
$begingroup$
By $p(x,y)$ do you just mean $p(x wedge y)$ (i.e., the probability of $x$ and $y$ co-occurring)?
$endgroup$
– PP121
Jun 26 '15 at 2:42
1
$begingroup$
@PP121 Yes. It's an abbreviation for the joint probability. More specifically $p_{X,Y}(x,y) = mathsf P(X=x cap Y=y)$.
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:44
$begingroup$
So then what is analogous to $X$ and $Y$ in the original example I wrote out? Is it $X = {ldots x ldots}$ and $Theta = {ldots theta ldots }$?
$endgroup$
– PP121
Jun 26 '15 at 2:49
1
$begingroup$
Yes. @PP121 That would be so. $p(xmid theta) = P(X=xmid Theta=theta)$
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:51
1
$begingroup$
@PP121 No; it is literal, at least for discrete random variables. $X$ being a discrete random variable means that, on inspection, it will be found to have one of the values within the sample space with a certain probability. For continuous random variables the appropriate measure is a probability density and things are somewhat more involved, but mostly the same principles apply.
$endgroup$
– Graham Kemp
Jun 26 '15 at 3:56
|
show 1 more comment
$begingroup$
The explanation I was given when I was taught conditional probabilities is that if you draw up a table of the probabilities $p(x,y)$, then the row/column sums
$$ p(x) = sum_{y} p(x,y) $$
(by the law of total probability) are written in the margins of the table.
$endgroup$
The explanation I was given when I was taught conditional probabilities is that if you draw up a table of the probabilities $p(x,y)$, then the row/column sums
$$ p(x) = sum_{y} p(x,y) $$
(by the law of total probability) are written in the margins of the table.
edited Jan 19 at 5:21
answered Jun 26 '15 at 2:38


ChappersChappers
55.9k74194
55.9k74194
$begingroup$
By $p(x,y)$ do you just mean $p(x wedge y)$ (i.e., the probability of $x$ and $y$ co-occurring)?
$endgroup$
– PP121
Jun 26 '15 at 2:42
1
$begingroup$
@PP121 Yes. It's an abbreviation for the joint probability. More specifically $p_{X,Y}(x,y) = mathsf P(X=x cap Y=y)$.
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:44
$begingroup$
So then what is analogous to $X$ and $Y$ in the original example I wrote out? Is it $X = {ldots x ldots}$ and $Theta = {ldots theta ldots }$?
$endgroup$
– PP121
Jun 26 '15 at 2:49
1
$begingroup$
Yes. @PP121 That would be so. $p(xmid theta) = P(X=xmid Theta=theta)$
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:51
1
$begingroup$
@PP121 No; it is literal, at least for discrete random variables. $X$ being a discrete random variable means that, on inspection, it will be found to have one of the values within the sample space with a certain probability. For continuous random variables the appropriate measure is a probability density and things are somewhat more involved, but mostly the same principles apply.
$endgroup$
– Graham Kemp
Jun 26 '15 at 3:56
|
show 1 more comment
$begingroup$
By $p(x,y)$ do you just mean $p(x wedge y)$ (i.e., the probability of $x$ and $y$ co-occurring)?
$endgroup$
– PP121
Jun 26 '15 at 2:42
1
$begingroup$
@PP121 Yes. It's an abbreviation for the joint probability. More specifically $p_{X,Y}(x,y) = mathsf P(X=x cap Y=y)$.
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:44
$begingroup$
So then what is analogous to $X$ and $Y$ in the original example I wrote out? Is it $X = {ldots x ldots}$ and $Theta = {ldots theta ldots }$?
$endgroup$
– PP121
Jun 26 '15 at 2:49
1
$begingroup$
Yes. @PP121 That would be so. $p(xmid theta) = P(X=xmid Theta=theta)$
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:51
1
$begingroup$
@PP121 No; it is literal, at least for discrete random variables. $X$ being a discrete random variable means that, on inspection, it will be found to have one of the values within the sample space with a certain probability. For continuous random variables the appropriate measure is a probability density and things are somewhat more involved, but mostly the same principles apply.
$endgroup$
– Graham Kemp
Jun 26 '15 at 3:56
$begingroup$
By $p(x,y)$ do you just mean $p(x wedge y)$ (i.e., the probability of $x$ and $y$ co-occurring)?
$endgroup$
– PP121
Jun 26 '15 at 2:42
$begingroup$
By $p(x,y)$ do you just mean $p(x wedge y)$ (i.e., the probability of $x$ and $y$ co-occurring)?
$endgroup$
– PP121
Jun 26 '15 at 2:42
1
1
$begingroup$
@PP121 Yes. It's an abbreviation for the joint probability. More specifically $p_{X,Y}(x,y) = mathsf P(X=x cap Y=y)$.
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:44
$begingroup$
@PP121 Yes. It's an abbreviation for the joint probability. More specifically $p_{X,Y}(x,y) = mathsf P(X=x cap Y=y)$.
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:44
$begingroup$
So then what is analogous to $X$ and $Y$ in the original example I wrote out? Is it $X = {ldots x ldots}$ and $Theta = {ldots theta ldots }$?
$endgroup$
– PP121
Jun 26 '15 at 2:49
$begingroup$
So then what is analogous to $X$ and $Y$ in the original example I wrote out? Is it $X = {ldots x ldots}$ and $Theta = {ldots theta ldots }$?
$endgroup$
– PP121
Jun 26 '15 at 2:49
1
1
$begingroup$
Yes. @PP121 That would be so. $p(xmid theta) = P(X=xmid Theta=theta)$
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:51
$begingroup$
Yes. @PP121 That would be so. $p(xmid theta) = P(X=xmid Theta=theta)$
$endgroup$
– Graham Kemp
Jun 26 '15 at 2:51
1
1
$begingroup$
@PP121 No; it is literal, at least for discrete random variables. $X$ being a discrete random variable means that, on inspection, it will be found to have one of the values within the sample space with a certain probability. For continuous random variables the appropriate measure is a probability density and things are somewhat more involved, but mostly the same principles apply.
$endgroup$
– Graham Kemp
Jun 26 '15 at 3:56
$begingroup$
@PP121 No; it is literal, at least for discrete random variables. $X$ being a discrete random variable means that, on inspection, it will be found to have one of the values within the sample space with a certain probability. For continuous random variables the appropriate measure is a probability density and things are somewhat more involved, but mostly the same principles apply.
$endgroup$
– Graham Kemp
Jun 26 '15 at 3:56
|
show 1 more comment
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1339666%2fwhy-do-we-refer-to-the-denominator-of-bayes-theorem-as-marginal-probability%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
"Marginal" does not mean "barely making the grade" but that the probability has been derived from a joint probability. The numerator is this instance is $p(xmid theta)p(theta) = p(x,theta)$ which is the joint probability and $p(x)$ (as well as $p(theta)$) are marginal probabilities of they are derived from $p(x,theta)$.
$endgroup$
– Dilip Sarwate
Jun 26 '15 at 2:40
$begingroup$
"Marginal probability" means the same thing here that it means in other contexts, i.e. "unconditional". (The meaning is context-dependent, since all probabilities are conditional.) $qquad$
$endgroup$
– Michael Hardy
Sep 27 '16 at 7:20