Derivation of Linear Regression using Normal Equations












0












$begingroup$


I was going through Andrew Ng's course on ML and had a doubt regarding one of the steps while deriving the solution for linear regression using normal equations.



Normal equation: $theta=(X^TX)^{-1}X^TY$



While deriving, there's this step:



$frac{delta}{deltatheta}theta^TX^TXtheta = X^TXfrac{delta}{deltatheta}theta^Ttheta$



But isn't matrix multiplication commutative, for us to take out $X^TX$ from inside the derivative?



Thanks










share|cite|improve this question









$endgroup$

















    0












    $begingroup$


    I was going through Andrew Ng's course on ML and had a doubt regarding one of the steps while deriving the solution for linear regression using normal equations.



    Normal equation: $theta=(X^TX)^{-1}X^TY$



    While deriving, there's this step:



    $frac{delta}{deltatheta}theta^TX^TXtheta = X^TXfrac{delta}{deltatheta}theta^Ttheta$



    But isn't matrix multiplication commutative, for us to take out $X^TX$ from inside the derivative?



    Thanks










    share|cite|improve this question









    $endgroup$















      0












      0








      0





      $begingroup$


      I was going through Andrew Ng's course on ML and had a doubt regarding one of the steps while deriving the solution for linear regression using normal equations.



      Normal equation: $theta=(X^TX)^{-1}X^TY$



      While deriving, there's this step:



      $frac{delta}{deltatheta}theta^TX^TXtheta = X^TXfrac{delta}{deltatheta}theta^Ttheta$



      But isn't matrix multiplication commutative, for us to take out $X^TX$ from inside the derivative?



      Thanks










      share|cite|improve this question









      $endgroup$




      I was going through Andrew Ng's course on ML and had a doubt regarding one of the steps while deriving the solution for linear regression using normal equations.



      Normal equation: $theta=(X^TX)^{-1}X^TY$



      While deriving, there's this step:



      $frac{delta}{deltatheta}theta^TX^TXtheta = X^TXfrac{delta}{deltatheta}theta^Ttheta$



      But isn't matrix multiplication commutative, for us to take out $X^TX$ from inside the derivative?



      Thanks







      matrix-calculus linear-regression






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Jan 13 at 19:13









      Rish1618Rish1618

      31




      31






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          Given two symmetric $(A, B)$ consider these following the scalar functions and their gradients
          $$eqalign{
          alpha &= theta^TAtheta &implies frac{partialalpha}{partialtheta}=2Atheta cr
          beta &= theta^TBtheta &implies frac{partialbeta}{partialtheta}=2Btheta cr
          }$$

          It's not terribly illuminating, but you can write the second gradient in terms of the first, i.e.
          $$frac{partialbeta}{partialtheta} = BA^{-1}frac{partialalpha}{partialtheta}$$
          For the purposes of your question, $A=I$ and $B=X^TX$.






          share|cite|improve this answer









          $endgroup$





















            0












            $begingroup$

            Although that equality is true, it does not give insight into why it is true.



            There are many ways to compute that gradient, but here is a direct approach that simply computes all the partial derivatives individually.



            Let $A$ be a symmetric matrix. (In your context, $A= X^top X$.)
            The partial derivative of $theta^top A theta = sum_i sum_j A_{ij} theta_i theta_j$ with respect to $theta_k$ is
            $$frac{partial}{partial theta_k} theta^top A theta = sum_i sum_j A_{ij} frac{partial}{partial theta_k}(theta_i theta_j) = A_{kk} cdot 2 theta_k + sum_{i ne k} A_{ik} cdot theta_ i + sum_{j ne k} A_{kj} theta_j = 2sum_i A_{ki} theta_i = 2 (A theta)_k$$
            Stacking the partial derivatives into a vector gives you the gradient, so
            $$nabla_theta theta^top A theta = 2 A theta.$$






            share|cite|improve this answer









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "69"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              noCode: true, onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3072408%2fderivation-of-linear-regression-using-normal-equations%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1












              $begingroup$

              Given two symmetric $(A, B)$ consider these following the scalar functions and their gradients
              $$eqalign{
              alpha &= theta^TAtheta &implies frac{partialalpha}{partialtheta}=2Atheta cr
              beta &= theta^TBtheta &implies frac{partialbeta}{partialtheta}=2Btheta cr
              }$$

              It's not terribly illuminating, but you can write the second gradient in terms of the first, i.e.
              $$frac{partialbeta}{partialtheta} = BA^{-1}frac{partialalpha}{partialtheta}$$
              For the purposes of your question, $A=I$ and $B=X^TX$.






              share|cite|improve this answer









              $endgroup$


















                1












                $begingroup$

                Given two symmetric $(A, B)$ consider these following the scalar functions and their gradients
                $$eqalign{
                alpha &= theta^TAtheta &implies frac{partialalpha}{partialtheta}=2Atheta cr
                beta &= theta^TBtheta &implies frac{partialbeta}{partialtheta}=2Btheta cr
                }$$

                It's not terribly illuminating, but you can write the second gradient in terms of the first, i.e.
                $$frac{partialbeta}{partialtheta} = BA^{-1}frac{partialalpha}{partialtheta}$$
                For the purposes of your question, $A=I$ and $B=X^TX$.






                share|cite|improve this answer









                $endgroup$
















                  1












                  1








                  1





                  $begingroup$

                  Given two symmetric $(A, B)$ consider these following the scalar functions and their gradients
                  $$eqalign{
                  alpha &= theta^TAtheta &implies frac{partialalpha}{partialtheta}=2Atheta cr
                  beta &= theta^TBtheta &implies frac{partialbeta}{partialtheta}=2Btheta cr
                  }$$

                  It's not terribly illuminating, but you can write the second gradient in terms of the first, i.e.
                  $$frac{partialbeta}{partialtheta} = BA^{-1}frac{partialalpha}{partialtheta}$$
                  For the purposes of your question, $A=I$ and $B=X^TX$.






                  share|cite|improve this answer









                  $endgroup$



                  Given two symmetric $(A, B)$ consider these following the scalar functions and their gradients
                  $$eqalign{
                  alpha &= theta^TAtheta &implies frac{partialalpha}{partialtheta}=2Atheta cr
                  beta &= theta^TBtheta &implies frac{partialbeta}{partialtheta}=2Btheta cr
                  }$$

                  It's not terribly illuminating, but you can write the second gradient in terms of the first, i.e.
                  $$frac{partialbeta}{partialtheta} = BA^{-1}frac{partialalpha}{partialtheta}$$
                  For the purposes of your question, $A=I$ and $B=X^TX$.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered Jan 13 at 21:30









                  greggreg

                  8,2751823




                  8,2751823























                      0












                      $begingroup$

                      Although that equality is true, it does not give insight into why it is true.



                      There are many ways to compute that gradient, but here is a direct approach that simply computes all the partial derivatives individually.



                      Let $A$ be a symmetric matrix. (In your context, $A= X^top X$.)
                      The partial derivative of $theta^top A theta = sum_i sum_j A_{ij} theta_i theta_j$ with respect to $theta_k$ is
                      $$frac{partial}{partial theta_k} theta^top A theta = sum_i sum_j A_{ij} frac{partial}{partial theta_k}(theta_i theta_j) = A_{kk} cdot 2 theta_k + sum_{i ne k} A_{ik} cdot theta_ i + sum_{j ne k} A_{kj} theta_j = 2sum_i A_{ki} theta_i = 2 (A theta)_k$$
                      Stacking the partial derivatives into a vector gives you the gradient, so
                      $$nabla_theta theta^top A theta = 2 A theta.$$






                      share|cite|improve this answer









                      $endgroup$


















                        0












                        $begingroup$

                        Although that equality is true, it does not give insight into why it is true.



                        There are many ways to compute that gradient, but here is a direct approach that simply computes all the partial derivatives individually.



                        Let $A$ be a symmetric matrix. (In your context, $A= X^top X$.)
                        The partial derivative of $theta^top A theta = sum_i sum_j A_{ij} theta_i theta_j$ with respect to $theta_k$ is
                        $$frac{partial}{partial theta_k} theta^top A theta = sum_i sum_j A_{ij} frac{partial}{partial theta_k}(theta_i theta_j) = A_{kk} cdot 2 theta_k + sum_{i ne k} A_{ik} cdot theta_ i + sum_{j ne k} A_{kj} theta_j = 2sum_i A_{ki} theta_i = 2 (A theta)_k$$
                        Stacking the partial derivatives into a vector gives you the gradient, so
                        $$nabla_theta theta^top A theta = 2 A theta.$$






                        share|cite|improve this answer









                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          Although that equality is true, it does not give insight into why it is true.



                          There are many ways to compute that gradient, but here is a direct approach that simply computes all the partial derivatives individually.



                          Let $A$ be a symmetric matrix. (In your context, $A= X^top X$.)
                          The partial derivative of $theta^top A theta = sum_i sum_j A_{ij} theta_i theta_j$ with respect to $theta_k$ is
                          $$frac{partial}{partial theta_k} theta^top A theta = sum_i sum_j A_{ij} frac{partial}{partial theta_k}(theta_i theta_j) = A_{kk} cdot 2 theta_k + sum_{i ne k} A_{ik} cdot theta_ i + sum_{j ne k} A_{kj} theta_j = 2sum_i A_{ki} theta_i = 2 (A theta)_k$$
                          Stacking the partial derivatives into a vector gives you the gradient, so
                          $$nabla_theta theta^top A theta = 2 A theta.$$






                          share|cite|improve this answer









                          $endgroup$



                          Although that equality is true, it does not give insight into why it is true.



                          There are many ways to compute that gradient, but here is a direct approach that simply computes all the partial derivatives individually.



                          Let $A$ be a symmetric matrix. (In your context, $A= X^top X$.)
                          The partial derivative of $theta^top A theta = sum_i sum_j A_{ij} theta_i theta_j$ with respect to $theta_k$ is
                          $$frac{partial}{partial theta_k} theta^top A theta = sum_i sum_j A_{ij} frac{partial}{partial theta_k}(theta_i theta_j) = A_{kk} cdot 2 theta_k + sum_{i ne k} A_{ik} cdot theta_ i + sum_{j ne k} A_{kj} theta_j = 2sum_i A_{ki} theta_i = 2 (A theta)_k$$
                          Stacking the partial derivatives into a vector gives you the gradient, so
                          $$nabla_theta theta^top A theta = 2 A theta.$$







                          share|cite|improve this answer












                          share|cite|improve this answer



                          share|cite|improve this answer










                          answered Jan 13 at 19:29









                          angryavianangryavian

                          41k23380




                          41k23380






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Mathematics Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3072408%2fderivation-of-linear-regression-using-normal-equations%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

                              Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

                              A Topological Invariant for $pi_3(U(n))$