What temperature of Softmax layer should I use during neural network training?












5














I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?










share|cite|improve this question






















  • Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
    – mathreadler
    Dec 17 '15 at 11:06






  • 3




    @mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
    – R. A.
    Dec 17 '15 at 11:18












  • My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
    – mathreadler
    Dec 17 '15 at 11:24










  • @mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
    – R. A.
    Dec 17 '15 at 12:31


















5














I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?










share|cite|improve this question






















  • Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
    – mathreadler
    Dec 17 '15 at 11:06






  • 3




    @mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
    – R. A.
    Dec 17 '15 at 11:18












  • My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
    – mathreadler
    Dec 17 '15 at 11:24










  • @mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
    – R. A.
    Dec 17 '15 at 12:31
















5












5








5


1





I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?










share|cite|improve this question













I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?







machine-learning neural-networks






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Dec 17 '15 at 10:21









R. A.R. A.

2614




2614












  • Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
    – mathreadler
    Dec 17 '15 at 11:06






  • 3




    @mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
    – R. A.
    Dec 17 '15 at 11:18












  • My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
    – mathreadler
    Dec 17 '15 at 11:24










  • @mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
    – R. A.
    Dec 17 '15 at 12:31




















  • Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
    – mathreadler
    Dec 17 '15 at 11:06






  • 3




    @mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
    – R. A.
    Dec 17 '15 at 11:18












  • My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
    – mathreadler
    Dec 17 '15 at 11:24










  • @mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
    – R. A.
    Dec 17 '15 at 12:31


















Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06




Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06




3




3




@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18






@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18














My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24




My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24












@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31






@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31












2 Answers
2






active

oldest

votes


















0














Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.






share|cite|improve this answer























  • I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
    – isarandi
    Mar 19 '18 at 18:40





















0














Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).






share|cite|improve this answer





















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1579601%2fwhat-temperature-of-softmax-layer-should-i-use-during-neural-network-training%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.






    share|cite|improve this answer























    • I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
      – isarandi
      Mar 19 '18 at 18:40


















    0














    Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.






    share|cite|improve this answer























    • I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
      – isarandi
      Mar 19 '18 at 18:40
















    0












    0








    0






    Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.






    share|cite|improve this answer














    Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited Mar 21 '18 at 0:29

























    answered Feb 27 '18 at 2:19









    Peixiang ZhongPeixiang Zhong

    12




    12












    • I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
      – isarandi
      Mar 19 '18 at 18:40




















    • I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
      – isarandi
      Mar 19 '18 at 18:40


















    I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
    – isarandi
    Mar 19 '18 at 18:40






    I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
    – isarandi
    Mar 19 '18 at 18:40













    0














    Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).






    share|cite|improve this answer


























      0














      Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).






      share|cite|improve this answer
























        0












        0








        0






        Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).






        share|cite|improve this answer












        Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered May 26 '18 at 16:05









        dioiddioid

        967511




        967511






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1579601%2fwhat-temperature-of-softmax-layer-should-i-use-during-neural-network-training%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

            Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

            A Topological Invariant for $pi_3(U(n))$