What temperature of Softmax layer should I use during neural network training?
I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?
machine-learning neural-networks
add a comment |
I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?
machine-learning neural-networks
Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06
3
@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18
My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24
@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31
add a comment |
I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?
machine-learning neural-networks
I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?
machine-learning neural-networks
machine-learning neural-networks
asked Dec 17 '15 at 10:21
R. A.R. A.
2614
2614
Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06
3
@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18
My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24
@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31
add a comment |
Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06
3
@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18
My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24
@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31
Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06
Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06
3
3
@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18
@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18
My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24
My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24
@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31
@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31
add a comment |
2 Answers
2
active
oldest
votes
Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.
I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40
add a comment |
Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1579601%2fwhat-temperature-of-softmax-layer-should-i-use-during-neural-network-training%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.
I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40
add a comment |
Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.
I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40
add a comment |
Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.
Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.
edited Mar 21 '18 at 0:29
answered Feb 27 '18 at 2:19
Peixiang ZhongPeixiang Zhong
12
12
I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40
add a comment |
I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40
I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40
I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40
add a comment |
Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).
add a comment |
Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).
add a comment |
Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).
Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).
answered May 26 '18 at 16:05
dioiddioid
967511
967511
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1579601%2fwhat-temperature-of-softmax-layer-should-i-use-during-neural-network-training%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06
3
@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18
My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24
@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31