What temperature of Softmax layer should I use during neural network training?

I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature":
$$
P_{i} = frac{e^{frac{y_{i}}{T}}}{sum_{k=1}^{n}e^{frac{y_{k}}{T}}}
$$
but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?

asked Dec 17 '15 at 10:21

R. A.

2614

Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06

3

@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18

My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24

@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31

add a comment |

asked Dec 17 '15 at 10:21

R. A.

2614

Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06

3

@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18

My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24

@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31

add a comment |

asked Dec 17 '15 at 10:21

R. A.

2614

machine-learning neural-networks

asked Dec 17 '15 at 10:21

R. A.

2614

asked Dec 17 '15 at 10:21

R. A.

2614

asked Dec 17 '15 at 10:21

R. A.

2614

asked Dec 17 '15 at 10:21

R. A.

2614

asked Dec 17 '15 at 10:21

R. A.

2614

Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06

3

@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18

My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24

@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31

add a comment |

Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06

3

@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18

My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24

@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31

Is the idea of temperature that it should increase as activity increases. It's kind of cold until it has "warmed up", maybe?
– mathreadler
Dec 17 '15 at 11:06

@mathreadler The idea behind temperature in softmax is to control randomness of predictions - at high temperature Softmax outputs are more close to each other (probabilities will have same values with T=inf), at low temperatures "softmax" become more and more "hardmax" (probability, corresponding to max input will be ~1.0, while others ~0.0 at T=0.0). So, I know how to use it during sampling values from my network. But the question is should I train network using values of T other then 1?
– R. A.
Dec 17 '15 at 11:18

My guess is that you should alter T as you train the network. It's cold before you get up to speed with the training and then gradually gets hotter as you train.
– mathreadler
Dec 17 '15 at 11:24

@mathreadler Network outputs will gradually become less and less meaningfull, I afraid, because high temperature will erase difference between them. Network will probably eventually learn to ouptut values with very high differences to deal with this issue, so I won't get controllable randomness at the end of training. So, I think that correct approach is to train with T=1, but sample with various T (depending on output rquirements). I just read source code of char-level RRN - It seems that I am right.
– R. A.
Dec 17 '15 at 12:31

add a comment |

2 Answers
2

active

oldest

votes

Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.

edited Mar 21 '18 at 0:29

answered Feb 27 '18 at 2:19

Peixiang Zhong

I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40

add a comment |

Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).

answered May 26 '18 at 16:05

dioid

967511

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1579601%2fwhat-temperature-of-softmax-layer-should-i-use-during-neural-network-training%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.

edited Mar 21 '18 at 0:29

answered Feb 27 '18 at 2:19

Peixiang Zhong

I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40

add a comment |

Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.

edited Mar 21 '18 at 0:29

answered Feb 27 '18 at 2:19

Peixiang Zhong

I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40

add a comment |

Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.

edited Mar 21 '18 at 0:29

answered Feb 27 '18 at 2:19

Peixiang Zhong

Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.

edited Mar 21 '18 at 0:29

answered Feb 27 '18 at 2:19

Peixiang Zhong

edited Mar 21 '18 at 0:29

answered Feb 27 '18 at 2:19

Peixiang Zhong

answered Feb 27 '18 at 2:19

Peixiang Zhong

answered Feb 27 '18 at 2:19

Peixiang Zhong

I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40

add a comment |

I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40

I don't think this is true. It's more like grad_new(x)=1/T * grad_old(x*T), not simply grad_new(x)=1/T * grad_old(x)
– isarandi
Mar 19 '18 at 18:40

add a comment |

answered May 26 '18 at 16:05

dioid

967511

add a comment |

answered May 26 '18 at 16:05

dioid

967511

add a comment |

answered May 26 '18 at 16:05

dioid

967511

answered May 26 '18 at 16:05

dioid

967511

answered May 26 '18 at 16:05

dioid

967511

answered May 26 '18 at 16:05

dioid

967511

answered May 26 '18 at 16:05

dioid

967511

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu