How can i process multi loss in pytorch?












4















enter image description here



Such as this, I want to using some auxiliary loss to promoting my model performance.

Which type code can implement it in pytorch?



#one
loss1.backward()
loss2.backward()
loss3.backward()
optimizer.step()
#two
loss1.backward()
optimizer.step()
loss2.backward()
optimizer.step()
loss3.backward()
optimizer.step()
#three
loss = loss1+loss2+loss3
loss.backward()
optimizer.step()


Thanks for your answer!










share|improve this question





























    4















    enter image description here



    Such as this, I want to using some auxiliary loss to promoting my model performance.

    Which type code can implement it in pytorch?



    #one
    loss1.backward()
    loss2.backward()
    loss3.backward()
    optimizer.step()
    #two
    loss1.backward()
    optimizer.step()
    loss2.backward()
    optimizer.step()
    loss3.backward()
    optimizer.step()
    #three
    loss = loss1+loss2+loss3
    loss.backward()
    optimizer.step()


    Thanks for your answer!










    share|improve this question



























      4












      4








      4


      0






      enter image description here



      Such as this, I want to using some auxiliary loss to promoting my model performance.

      Which type code can implement it in pytorch?



      #one
      loss1.backward()
      loss2.backward()
      loss3.backward()
      optimizer.step()
      #two
      loss1.backward()
      optimizer.step()
      loss2.backward()
      optimizer.step()
      loss3.backward()
      optimizer.step()
      #three
      loss = loss1+loss2+loss3
      loss.backward()
      optimizer.step()


      Thanks for your answer!










      share|improve this question
















      enter image description here



      Such as this, I want to using some auxiliary loss to promoting my model performance.

      Which type code can implement it in pytorch?



      #one
      loss1.backward()
      loss2.backward()
      loss3.backward()
      optimizer.step()
      #two
      loss1.backward()
      optimizer.step()
      loss2.backward()
      optimizer.step()
      loss3.backward()
      optimizer.step()
      #three
      loss = loss1+loss2+loss3
      loss.backward()
      optimizer.step()


      Thanks for your answer!







      python pytorch






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 2 at 9:21









      KonstantinosKokos

      1,4471413




      1,4471413










      asked Jan 1 at 10:12









      heiheiheiheiheihei

      709




      709
























          2 Answers
          2






          active

          oldest

          votes


















          4














          First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.



          Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward() calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step() call is same as having 8 batches of size 128 and 8 backward() calls, with one step() call in the end.



          To illustrate the idea, here is a simple example. We want to get our tensor x closest to [40,50,60] simultaneously:



          x = torch.tensor([1.0],requires_grad=True)
          loss1 = criterion(40,x)
          loss2 = criterion(50,x)
          loss3 = criterion(60,x)


          Now the first approach: (we use tensor.grad to get current gradient for our tensor x)



          loss1.backward()
          loss2.backward()
          loss3.backward()

          print(x.grad)


          This outputs : tensor([-294.])



          The third approach:



          loss = loss1+loss2+loss3
          loss.backward()
          print(x.grad)


          Again the output is : tensor([-294.])



          2nd approach is different because we don't call opt.zero_grad after calling step() method. This means in all 3 step calls gradients of first backward call is used. For example, if 3 losses provide gradients 5,1,4 for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21 as gradient.



          I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2






          share|improve this answer





















          • 1





            I used the third method,it's worked.Thank you for your patience and careful reply.

            – heiheihei
            Jan 2 at 8:18





















          5














          -- Comment on first approach removed, see other answer --



          Your second approach would require that you backpropagate with retain_graph=True, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward() call would compute the gradients prior to the update, which means that the second step() call would insert noise into your updates. If on the other hand you performed another forward() call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step(), and then once more for each subsequent step() call (not wrong per se, but inefficient and probably not what you wanted in the first place).



          Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)






          share|improve this answer





















          • 1





            I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....

            – Shihab Shahriar
            Jan 1 at 16:29








          • 1





            Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.

            – KonstantinosKokos
            Jan 1 at 18:09






          • 1





            Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!

            – heiheihei
            Jan 2 at 8:29











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994625%2fhow-can-i-process-multi-loss-in-pytorch%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          4














          First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.



          Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward() calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step() call is same as having 8 batches of size 128 and 8 backward() calls, with one step() call in the end.



          To illustrate the idea, here is a simple example. We want to get our tensor x closest to [40,50,60] simultaneously:



          x = torch.tensor([1.0],requires_grad=True)
          loss1 = criterion(40,x)
          loss2 = criterion(50,x)
          loss3 = criterion(60,x)


          Now the first approach: (we use tensor.grad to get current gradient for our tensor x)



          loss1.backward()
          loss2.backward()
          loss3.backward()

          print(x.grad)


          This outputs : tensor([-294.])



          The third approach:



          loss = loss1+loss2+loss3
          loss.backward()
          print(x.grad)


          Again the output is : tensor([-294.])



          2nd approach is different because we don't call opt.zero_grad after calling step() method. This means in all 3 step calls gradients of first backward call is used. For example, if 3 losses provide gradients 5,1,4 for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21 as gradient.



          I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2






          share|improve this answer





















          • 1





            I used the third method,it's worked.Thank you for your patience and careful reply.

            – heiheihei
            Jan 2 at 8:18


















          4














          First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.



          Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward() calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step() call is same as having 8 batches of size 128 and 8 backward() calls, with one step() call in the end.



          To illustrate the idea, here is a simple example. We want to get our tensor x closest to [40,50,60] simultaneously:



          x = torch.tensor([1.0],requires_grad=True)
          loss1 = criterion(40,x)
          loss2 = criterion(50,x)
          loss3 = criterion(60,x)


          Now the first approach: (we use tensor.grad to get current gradient for our tensor x)



          loss1.backward()
          loss2.backward()
          loss3.backward()

          print(x.grad)


          This outputs : tensor([-294.])



          The third approach:



          loss = loss1+loss2+loss3
          loss.backward()
          print(x.grad)


          Again the output is : tensor([-294.])



          2nd approach is different because we don't call opt.zero_grad after calling step() method. This means in all 3 step calls gradients of first backward call is used. For example, if 3 losses provide gradients 5,1,4 for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21 as gradient.



          I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2






          share|improve this answer





















          • 1





            I used the third method,it's worked.Thank you for your patience and careful reply.

            – heiheihei
            Jan 2 at 8:18
















          4












          4








          4







          First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.



          Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward() calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step() call is same as having 8 batches of size 128 and 8 backward() calls, with one step() call in the end.



          To illustrate the idea, here is a simple example. We want to get our tensor x closest to [40,50,60] simultaneously:



          x = torch.tensor([1.0],requires_grad=True)
          loss1 = criterion(40,x)
          loss2 = criterion(50,x)
          loss3 = criterion(60,x)


          Now the first approach: (we use tensor.grad to get current gradient for our tensor x)



          loss1.backward()
          loss2.backward()
          loss3.backward()

          print(x.grad)


          This outputs : tensor([-294.])



          The third approach:



          loss = loss1+loss2+loss3
          loss.backward()
          print(x.grad)


          Again the output is : tensor([-294.])



          2nd approach is different because we don't call opt.zero_grad after calling step() method. This means in all 3 step calls gradients of first backward call is used. For example, if 3 losses provide gradients 5,1,4 for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21 as gradient.



          I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2






          share|improve this answer















          First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.



          Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward() calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step() call is same as having 8 batches of size 128 and 8 backward() calls, with one step() call in the end.



          To illustrate the idea, here is a simple example. We want to get our tensor x closest to [40,50,60] simultaneously:



          x = torch.tensor([1.0],requires_grad=True)
          loss1 = criterion(40,x)
          loss2 = criterion(50,x)
          loss3 = criterion(60,x)


          Now the first approach: (we use tensor.grad to get current gradient for our tensor x)



          loss1.backward()
          loss2.backward()
          loss3.backward()

          print(x.grad)


          This outputs : tensor([-294.])



          The third approach:



          loss = loss1+loss2+loss3
          loss.backward()
          print(x.grad)


          Again the output is : tensor([-294.])



          2nd approach is different because we don't call opt.zero_grad after calling step() method. This means in all 3 step calls gradients of first backward call is used. For example, if 3 losses provide gradients 5,1,4 for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21 as gradient.



          I do agree with the conclusion though, use 3rd approach if memory isn't issue. For further reading : Link 1,Link 2







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 3 at 8:57

























          answered Jan 1 at 16:22









          Shihab ShahriarShihab Shahriar

          1,322714




          1,322714








          • 1





            I used the third method,it's worked.Thank you for your patience and careful reply.

            – heiheihei
            Jan 2 at 8:18
















          • 1





            I used the third method,it's worked.Thank you for your patience and careful reply.

            – heiheihei
            Jan 2 at 8:18










          1




          1





          I used the third method,it's worked.Thank you for your patience and careful reply.

          – heiheihei
          Jan 2 at 8:18







          I used the third method,it's worked.Thank you for your patience and careful reply.

          – heiheihei
          Jan 2 at 8:18















          5














          -- Comment on first approach removed, see other answer --



          Your second approach would require that you backpropagate with retain_graph=True, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward() call would compute the gradients prior to the update, which means that the second step() call would insert noise into your updates. If on the other hand you performed another forward() call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step(), and then once more for each subsequent step() call (not wrong per se, but inefficient and probably not what you wanted in the first place).



          Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)






          share|improve this answer





















          • 1





            I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....

            – Shihab Shahriar
            Jan 1 at 16:29








          • 1





            Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.

            – KonstantinosKokos
            Jan 1 at 18:09






          • 1





            Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!

            – heiheihei
            Jan 2 at 8:29
















          5














          -- Comment on first approach removed, see other answer --



          Your second approach would require that you backpropagate with retain_graph=True, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward() call would compute the gradients prior to the update, which means that the second step() call would insert noise into your updates. If on the other hand you performed another forward() call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step(), and then once more for each subsequent step() call (not wrong per se, but inefficient and probably not what you wanted in the first place).



          Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)






          share|improve this answer





















          • 1





            I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....

            – Shihab Shahriar
            Jan 1 at 16:29








          • 1





            Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.

            – KonstantinosKokos
            Jan 1 at 18:09






          • 1





            Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!

            – heiheihei
            Jan 2 at 8:29














          5












          5








          5







          -- Comment on first approach removed, see other answer --



          Your second approach would require that you backpropagate with retain_graph=True, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward() call would compute the gradients prior to the update, which means that the second step() call would insert noise into your updates. If on the other hand you performed another forward() call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step(), and then once more for each subsequent step() call (not wrong per se, but inefficient and probably not what you wanted in the first place).



          Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)






          share|improve this answer















          -- Comment on first approach removed, see other answer --



          Your second approach would require that you backpropagate with retain_graph=True, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward() call would compute the gradients prior to the update, which means that the second step() call would insert noise into your updates. If on the other hand you performed another forward() call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step(), and then once more for each subsequent step() call (not wrong per se, but inefficient and probably not what you wanted in the first place).



          Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 1 at 18:15

























          answered Jan 1 at 11:48









          KonstantinosKokosKonstantinosKokos

          1,4471413




          1,4471413








          • 1





            I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....

            – Shihab Shahriar
            Jan 1 at 16:29








          • 1





            Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.

            – KonstantinosKokos
            Jan 1 at 18:09






          • 1





            Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!

            – heiheihei
            Jan 2 at 8:29














          • 1





            I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....

            – Shihab Shahriar
            Jan 1 at 16:29








          • 1





            Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.

            – KonstantinosKokos
            Jan 1 at 18:09






          • 1





            Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!

            – heiheihei
            Jan 2 at 8:29








          1




          1





          I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....

          – Shihab Shahriar
          Jan 1 at 16:29







          I believe this might have few mistakes: link1,Link2. Please let me know if I've myself made any error....

          – Shihab Shahriar
          Jan 1 at 16:29






          1




          1





          Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.

          – KonstantinosKokos
          Jan 1 at 18:09





          Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason.

          – KonstantinosKokos
          Jan 1 at 18:09




          1




          1





          Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!

          – heiheihei
          Jan 2 at 8:29





          Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again!

          – heiheihei
          Jan 2 at 8:29


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994625%2fhow-can-i-process-multi-loss-in-pytorch%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

          ts Property 'filter' does not exist on type '{}'

          mat-slide-toggle shouldn't change it's state when I click cancel in confirmation window