Stochastic Neighbor Embedding (SNE) - How to Understand the Cost Function of the Kullback Leibler (KL)...












0












$begingroup$


In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is



$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$



Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?



To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?



Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.










share|cite|improve this question











$endgroup$












  • $begingroup$
    You should not use acronyms without writing them out at the first usage.
    $endgroup$
    – MrYouMath
    Aug 26 '18 at 7:11










  • $begingroup$
    Thanks, I've changed it
    $endgroup$
    – Doris
    Aug 26 '18 at 9:59
















0












$begingroup$


In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is



$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$



Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?



To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?



Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.










share|cite|improve this question











$endgroup$












  • $begingroup$
    You should not use acronyms without writing them out at the first usage.
    $endgroup$
    – MrYouMath
    Aug 26 '18 at 7:11










  • $begingroup$
    Thanks, I've changed it
    $endgroup$
    – Doris
    Aug 26 '18 at 9:59














0












0








0


0



$begingroup$


In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is



$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$



Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?



To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?



Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.










share|cite|improve this question











$endgroup$




In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is



$$
C
= sum_i sum_j p_{ij} log frac{p_{ij}}{q_{ij}}
= sum_i operatorname{KL}(P_i | Q_i).
$$



Question 1:
Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?



To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?



Question 2:
Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.







probability optimization machine-learning divergence






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Aug 26 '18 at 13:47









Royi

3,47012352




3,47012352










asked Aug 26 '18 at 3:08









DorisDoris

63




63












  • $begingroup$
    You should not use acronyms without writing them out at the first usage.
    $endgroup$
    – MrYouMath
    Aug 26 '18 at 7:11










  • $begingroup$
    Thanks, I've changed it
    $endgroup$
    – Doris
    Aug 26 '18 at 9:59


















  • $begingroup$
    You should not use acronyms without writing them out at the first usage.
    $endgroup$
    – MrYouMath
    Aug 26 '18 at 7:11










  • $begingroup$
    Thanks, I've changed it
    $endgroup$
    – Doris
    Aug 26 '18 at 9:59
















$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11




$begingroup$
You should not use acronyms without writing them out at the first usage.
$endgroup$
– MrYouMath
Aug 26 '18 at 7:11












$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59




$begingroup$
Thanks, I've changed it
$endgroup$
– Doris
Aug 26 '18 at 9:59










1 Answer
1






active

oldest

votes


















0












$begingroup$

Question 1



The paper discusses this in the following quote:




Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
a small distance in the low-dimensional space, though it is less than the cost of modeling
a small distance with a big one. In this respect, SNE is an improvement over methods
like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
distances, its cost function cleanly enforces both keeping the images of nearby objects
nearby and keeping the images of widely separated objects relatively far apart.




Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.



Question 2



Basically re-iterating: to keep the images of widely separated objects relatively far apart, we want $q_{ij}$ to be small when $p_{ij}$ is small.



If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.






share|cite|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2894673%2fstochastic-neighbor-embedding-sne-how-to-understand-the-cost-function-of-the%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    Question 1



    The paper discusses this in the following quote:




    Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
    distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
    a small distance in the low-dimensional space, though it is less than the cost of modeling
    a small distance with a big one. In this respect, SNE is an improvement over methods
    like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
    neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
    distances, its cost function cleanly enforces both keeping the images of nearby objects
    nearby and keeping the images of widely separated objects relatively far apart.




    Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.



    Question 2



    Basically re-iterating: to keep the images of widely separated objects relatively far apart, we want $q_{ij}$ to be small when $p_{ij}$ is small.



    If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.






    share|cite|improve this answer









    $endgroup$


















      0












      $begingroup$

      Question 1



      The paper discusses this in the following quote:




      Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
      distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
      a small distance in the low-dimensional space, though it is less than the cost of modeling
      a small distance with a big one. In this respect, SNE is an improvement over methods
      like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
      neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
      distances, its cost function cleanly enforces both keeping the images of nearby objects
      nearby and keeping the images of widely separated objects relatively far apart.




      Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.



      Question 2



      Basically re-iterating: to keep the images of widely separated objects relatively far apart, we want $q_{ij}$ to be small when $p_{ij}$ is small.



      If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.






      share|cite|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        Question 1



        The paper discusses this in the following quote:




        Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
        distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
        a small distance in the low-dimensional space, though it is less than the cost of modeling
        a small distance with a big one. In this respect, SNE is an improvement over methods
        like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
        neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
        distances, its cost function cleanly enforces both keeping the images of nearby objects
        nearby and keeping the images of widely separated objects relatively far apart.




        Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.



        Question 2



        Basically re-iterating: to keep the images of widely separated objects relatively far apart, we want $q_{ij}$ to be small when $p_{ij}$ is small.



        If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.






        share|cite|improve this answer









        $endgroup$



        Question 1



        The paper discusses this in the following quote:




        Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$
        distribution so there is a cost for modeling a big distance in the high-dimensionalspace with
        a small distance in the low-dimensional space, though it is less than the cost of modeling
        a small distance with a big one. In this respect, SNE is an improvement over methods
        like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near
        neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local
        distances, its cost function cleanly enforces both keeping the images of nearby objects
        nearby and keeping the images of widely separated objects relatively far apart.




        Ideally, to minimize the KL, we should have $q_{ij} approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{kell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.



        Question 2



        Basically re-iterating: to keep the images of widely separated objects relatively far apart, we want $q_{ij}$ to be small when $p_{ij}$ is small.



        If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Jan 13 at 22:15









        user3658307user3658307

        4,6533946




        4,6533946






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2894673%2fstochastic-neighbor-embedding-sne-how-to-understand-the-cost-function-of-the%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            Npm cannot find a required file even through it is in the searched directory

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith