Stochastic Neighbor Embedding (SNE) - How to Understand the Cost Function of the Kullback Leibler (KL)...
$begingroup$
In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is
$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$
Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?
To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?
Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.
probability optimization machine-learning divergence
$endgroup$
add a comment |
$begingroup$
In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is
$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$
Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?
To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?
Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.
probability optimization machine-learning divergence
$endgroup$
$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11
$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59
add a comment |
$begingroup$
In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is
$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$
Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?
To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?
Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.
probability optimization machine-learning divergence
$endgroup$
In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is
$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$
Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?
To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?
Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.
probability optimization machine-learning divergence
probability optimization machine-learning divergence
edited Aug 26 '18 at 13:47
Royi
3,47012352
3,47012352
asked Aug 26 '18 at 3:08


DorisDoris
63
63
$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11
$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59
add a comment |
$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11
$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59
$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11
$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11
$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59
$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Question 1
The paper discusses this in the following quote:
Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
a small distance in the low-dimensional space, though it is less than the cost of modeling
a small distance with a big one. In this respect, SNE is an improvement over methods
like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
distances, its cost function cleanly enforces both keeping the images of nearby objects
nearby and keeping the images of widely separated objects relatively far apart.
Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.
Question 2
Basically re-iterating: to keep the images of widely separated objects relatively far apart
, we want $q_{ij}$ to be small when $p_{ij}$ is small.
If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2894673%2fstochastic-neighbor-embedding-sne-how-to-understand-the-cost-function-of-the%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Question 1
The paper discusses this in the following quote:
Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
a small distance in the low-dimensional space, though it is less than the cost of modeling
a small distance with a big one. In this respect, SNE is an improvement over methods
like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
distances, its cost function cleanly enforces both keeping the images of nearby objects
nearby and keeping the images of widely separated objects relatively far apart.
Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.
Question 2
Basically re-iterating: to keep the images of widely separated objects relatively far apart
, we want $q_{ij}$ to be small when $p_{ij}$ is small.
If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.
$endgroup$
add a comment |
$begingroup$
Question 1
The paper discusses this in the following quote:
Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
a small distance in the low-dimensional space, though it is less than the cost of modeling
a small distance with a big one. In this respect, SNE is an improvement over methods
like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
distances, its cost function cleanly enforces both keeping the images of nearby objects
nearby and keeping the images of widely separated objects relatively far apart.
Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.
Question 2
Basically re-iterating: to keep the images of widely separated objects relatively far apart
, we want $q_{ij}$ to be small when $p_{ij}$ is small.
If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.
$endgroup$
add a comment |
$begingroup$
Question 1
The paper discusses this in the following quote:
Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
a small distance in the low-dimensional space, though it is less than the cost of modeling
a small distance with a big one. In this respect, SNE is an improvement over methods
like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
distances, its cost function cleanly enforces both keeping the images of nearby objects
nearby and keeping the images of widely separated objects relatively far apart.
Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.
Question 2
Basically re-iterating: to keep the images of widely separated objects relatively far apart
, we want $q_{ij}$ to be small when $p_{ij}$ is small.
If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.
$endgroup$
Question 1
The paper discusses this in the following quote:
Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
a small distance in the low-dimensional space, though it is less than the cost of modeling
a small distance with a big one. In this respect, SNE is an improvement over methods
like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
distances, its cost function cleanly enforces both keeping the images of nearby objects
nearby and keeping the images of widely separated objects relatively far apart.
Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.
Question 2
Basically re-iterating: to keep the images of widely separated objects relatively far apart
, we want $q_{ij}$ to be small when $p_{ij}$ is small.
If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.
answered Jan 13 at 22:15
user3658307user3658307
4,6533946
4,6533946
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2894673%2fstochastic-neighbor-embedding-sne-how-to-understand-the-cost-function-of-the%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11
$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59