How can i process multi loss in pytorch?
Such as this, I want to using some auxiliary loss to promoting my model performance.
Which type code can implement it in pytorch?
#one
loss1.backward()
loss2.backward()
loss3.backward()
optimizer.step()
#two
loss1.backward()
optimizer.step()
loss2.backward()
optimizer.step()
loss3.backward()
optimizer.step()
#three
loss = loss1+loss2+loss3
loss.backward()
optimizer.step()
Thanks for your answer!
python pytorch
add a comment |
Such as this, I want to using some auxiliary loss to promoting my model performance.
Which type code can implement it in pytorch?
#one
loss1.backward()
loss2.backward()
loss3.backward()
optimizer.step()
#two
loss1.backward()
optimizer.step()
loss2.backward()
optimizer.step()
loss3.backward()
optimizer.step()
#three
loss = loss1+loss2+loss3
loss.backward()
optimizer.step()
Thanks for your answer!
python pytorch
add a comment |
Such as this, I want to using some auxiliary loss to promoting my model performance.
Which type code can implement it in pytorch?
#one
loss1.backward()
loss2.backward()
loss3.backward()
optimizer.step()
#two
loss1.backward()
optimizer.step()
loss2.backward()
optimizer.step()
loss3.backward()
optimizer.step()
#three
loss = loss1+loss2+loss3
loss.backward()
optimizer.step()
Thanks for your answer!
python pytorch
Such as this, I want to using some auxiliary loss to promoting my model performance.
Which type code can implement it in pytorch?
#one
loss1.backward()
loss2.backward()
loss3.backward()
optimizer.step()
#two
loss1.backward()
optimizer.step()
loss2.backward()
optimizer.step()
loss3.backward()
optimizer.step()
#three
loss = loss1+loss2+loss3
loss.backward()
optimizer.step()
Thanks for your answer!
python pytorch
python pytorch
edited Jan 2 at 9:21
KonstantinosKokos
1,4471413
1,4471413
asked Jan 1 at 10:12
heiheiheiheiheihei
709
709
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.
Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward()
calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step()
call is same as having 8 batches of size 128 and 8 backward()
calls, with one step()
call in the end.
To illustrate the idea, here is a simple example. We want to get our tensor x
closest to [40,50,60]
simultaneously:
x = torch.tensor([1.0],requires_grad=True)
loss1 = criterion(40,x)
loss2 = criterion(50,x)
loss3 = criterion(60,x)
Now the first approach: (we use tensor.grad
to get current gradient for our tensor x
)
loss1.backward()
loss2.backward()
loss3.backward()
print(x.grad)
This outputs : tensor([-294.])
The third approach:
loss = loss1+loss2+loss3
loss.backward()
print(x.grad)
Again the output is : tensor([-294.])
2nd approach is different because we don't call opt.zero_grad
after calling step()
method. This means in all 3 step
calls gradients of first backward
call is used. For example, if 3 losses provide gradients 5,1,4
for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21
as gradient.
I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2
1
I used the third method,it's worked.Thank you for your patience and careful reply.
– heiheihei
Jan 2 at 8:18
add a comment |
-- Comment on first approach removed, see other answer --
Your second approach would require that you backpropagate with retain_graph=True
, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward()
call would compute the gradients prior to the update, which means that the second step()
call would insert noise into your updates. If on the other hand you performed another forward()
call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step()
, and then once more for each subsequent step()
call (not wrong per se, but inefficient and probably not what you wanted in the first place).
Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)
1
I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....
– Shihab Shahriar
Jan 1 at 16:29
1
Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.
– KonstantinosKokos
Jan 1 at 18:09
1
Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!
– heiheihei
Jan 2 at 8:29
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994625%2fhow-can-i-process-multi-loss-in-pytorch%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.
Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward()
calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step()
call is same as having 8 batches of size 128 and 8 backward()
calls, with one step()
call in the end.
To illustrate the idea, here is a simple example. We want to get our tensor x
closest to [40,50,60]
simultaneously:
x = torch.tensor([1.0],requires_grad=True)
loss1 = criterion(40,x)
loss2 = criterion(50,x)
loss3 = criterion(60,x)
Now the first approach: (we use tensor.grad
to get current gradient for our tensor x
)
loss1.backward()
loss2.backward()
loss3.backward()
print(x.grad)
This outputs : tensor([-294.])
The third approach:
loss = loss1+loss2+loss3
loss.backward()
print(x.grad)
Again the output is : tensor([-294.])
2nd approach is different because we don't call opt.zero_grad
after calling step()
method. This means in all 3 step
calls gradients of first backward
call is used. For example, if 3 losses provide gradients 5,1,4
for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21
as gradient.
I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2
1
I used the third method,it's worked.Thank you for your patience and careful reply.
– heiheihei
Jan 2 at 8:18
add a comment |
First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.
Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward()
calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step()
call is same as having 8 batches of size 128 and 8 backward()
calls, with one step()
call in the end.
To illustrate the idea, here is a simple example. We want to get our tensor x
closest to [40,50,60]
simultaneously:
x = torch.tensor([1.0],requires_grad=True)
loss1 = criterion(40,x)
loss2 = criterion(50,x)
loss3 = criterion(60,x)
Now the first approach: (we use tensor.grad
to get current gradient for our tensor x
)
loss1.backward()
loss2.backward()
loss3.backward()
print(x.grad)
This outputs : tensor([-294.])
The third approach:
loss = loss1+loss2+loss3
loss.backward()
print(x.grad)
Again the output is : tensor([-294.])
2nd approach is different because we don't call opt.zero_grad
after calling step()
method. This means in all 3 step
calls gradients of first backward
call is used. For example, if 3 losses provide gradients 5,1,4
for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21
as gradient.
I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2
1
I used the third method,it's worked.Thank you for your patience and careful reply.
– heiheihei
Jan 2 at 8:18
add a comment |
First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.
Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward()
calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step()
call is same as having 8 batches of size 128 and 8 backward()
calls, with one step()
call in the end.
To illustrate the idea, here is a simple example. We want to get our tensor x
closest to [40,50,60]
simultaneously:
x = torch.tensor([1.0],requires_grad=True)
loss1 = criterion(40,x)
loss2 = criterion(50,x)
loss3 = criterion(60,x)
Now the first approach: (we use tensor.grad
to get current gradient for our tensor x
)
loss1.backward()
loss2.backward()
loss3.backward()
print(x.grad)
This outputs : tensor([-294.])
The third approach:
loss = loss1+loss2+loss3
loss.backward()
print(x.grad)
Again the output is : tensor([-294.])
2nd approach is different because we don't call opt.zero_grad
after calling step()
method. This means in all 3 step
calls gradients of first backward
call is used. For example, if 3 losses provide gradients 5,1,4
for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21
as gradient.
I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2
First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.
Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward()
calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step()
call is same as having 8 batches of size 128 and 8 backward()
calls, with one step()
call in the end.
To illustrate the idea, here is a simple example. We want to get our tensor x
closest to [40,50,60]
simultaneously:
x = torch.tensor([1.0],requires_grad=True)
loss1 = criterion(40,x)
loss2 = criterion(50,x)
loss3 = criterion(60,x)
Now the first approach: (we use tensor.grad
to get current gradient for our tensor x
)
loss1.backward()
loss2.backward()
loss3.backward()
print(x.grad)
This outputs : tensor([-294.])
The third approach:
loss = loss1+loss2+loss3
loss.backward()
print(x.grad)
Again the output is : tensor([-294.])
2nd approach is different because we don't call opt.zero_grad
after calling step()
method. This means in all 3 step
calls gradients of first backward
call is used. For example, if 3 losses provide gradients 5,1,4
for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21
as gradient.
I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2
edited Jan 3 at 8:57
answered Jan 1 at 16:22
Shihab ShahriarShihab Shahriar
1,322714
1,322714
1
I used the third method,it's worked.Thank you for your patience and careful reply.
– heiheihei
Jan 2 at 8:18
add a comment |
1
I used the third method,it's worked.Thank you for your patience and careful reply.
– heiheihei
Jan 2 at 8:18
1
1
I used the third method,it's worked.Thank you for your patience and careful reply.
– heiheihei
Jan 2 at 8:18
I used the third method,it's worked.Thank you for your patience and careful reply.
– heiheihei
Jan 2 at 8:18
add a comment |
-- Comment on first approach removed, see other answer --
Your second approach would require that you backpropagate with retain_graph=True
, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward()
call would compute the gradients prior to the update, which means that the second step()
call would insert noise into your updates. If on the other hand you performed another forward()
call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step()
, and then once more for each subsequent step()
call (not wrong per se, but inefficient and probably not what you wanted in the first place).
Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)
1
I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....
– Shihab Shahriar
Jan 1 at 16:29
1
Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.
– KonstantinosKokos
Jan 1 at 18:09
1
Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!
– heiheihei
Jan 2 at 8:29
add a comment |
-- Comment on first approach removed, see other answer --
Your second approach would require that you backpropagate with retain_graph=True
, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward()
call would compute the gradients prior to the update, which means that the second step()
call would insert noise into your updates. If on the other hand you performed another forward()
call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step()
, and then once more for each subsequent step()
call (not wrong per se, but inefficient and probably not what you wanted in the first place).
Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)
1
I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....
– Shihab Shahriar
Jan 1 at 16:29
1
Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.
– KonstantinosKokos
Jan 1 at 18:09
1
Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!
– heiheihei
Jan 2 at 8:29
add a comment |
-- Comment on first approach removed, see other answer --
Your second approach would require that you backpropagate with retain_graph=True
, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward()
call would compute the gradients prior to the update, which means that the second step()
call would insert noise into your updates. If on the other hand you performed another forward()
call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step()
, and then once more for each subsequent step()
call (not wrong per se, but inefficient and probably not what you wanted in the first place).
Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)
-- Comment on first approach removed, see other answer --
Your second approach would require that you backpropagate with retain_graph=True
, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward()
call would compute the gradients prior to the update, which means that the second step()
call would insert noise into your updates. If on the other hand you performed another forward()
call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step()
, and then once more for each subsequent step()
call (not wrong per se, but inefficient and probably not what you wanted in the first place).
Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)
edited Jan 1 at 18:15
answered Jan 1 at 11:48
KonstantinosKokosKonstantinosKokos
1,4471413
1,4471413
1
I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....
– Shihab Shahriar
Jan 1 at 16:29
1
Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.
– KonstantinosKokos
Jan 1 at 18:09
1
Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!
– heiheihei
Jan 2 at 8:29
add a comment |
1
I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....
– Shihab Shahriar
Jan 1 at 16:29
1
Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.
– KonstantinosKokos
Jan 1 at 18:09
1
Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!
– heiheihei
Jan 2 at 8:29
1
1
I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....
– Shihab Shahriar
Jan 1 at 16:29
I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....
– Shihab Shahriar
Jan 1 at 16:29
1
1
Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.
– KonstantinosKokos
Jan 1 at 18:09
Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.
– KonstantinosKokos
Jan 1 at 18:09
1
1
Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!
– heiheihei
Jan 2 at 8:29
Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!
– heiheihei
Jan 2 at 8:29
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994625%2fhow-can-i-process-multi-loss-in-pytorch%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown