derivative of cost function for Neural Network classifier












0












$begingroup$


I am following Andrew NG's Machine Learning course on Coursera.



The cost function without regularization used in the Neural network course is:



$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$



, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.



I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.



For simplicity, I considered m = K = 1:



$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$



and tried to prove this to myself on paper but wasn't able to.



Neural Network Definition:



This neural network has 3 layers. (1 input, 1 hidden, 1 output).



It uses the sigmoid activation function,



$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.



Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).



Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).



Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.



During backpropagation, $delta^{(3)}$ is the error associated with the output layer.



Question:




  1. Why is it that:


$delta^{(3)} = h_{theta} - y$ ?




  1. Shouldn't:


$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?










share|cite|improve this question











$endgroup$








  • 1




    $begingroup$
    The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
    $endgroup$
    – reuns
    Jun 15 '17 at 16:07








  • 1




    $begingroup$
    You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
    $endgroup$
    – reuns
    Jun 15 '17 at 16:18






  • 1




    $begingroup$
    This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 16:43






  • 1




    $begingroup$
    Related: link
    $endgroup$
    – user3658307
    Jun 15 '17 at 17:01






  • 1




    $begingroup$
    @Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 17:05
















0












$begingroup$


I am following Andrew NG's Machine Learning course on Coursera.



The cost function without regularization used in the Neural network course is:



$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$



, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.



I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.



For simplicity, I considered m = K = 1:



$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$



and tried to prove this to myself on paper but wasn't able to.



Neural Network Definition:



This neural network has 3 layers. (1 input, 1 hidden, 1 output).



It uses the sigmoid activation function,



$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.



Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).



Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).



Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.



During backpropagation, $delta^{(3)}$ is the error associated with the output layer.



Question:




  1. Why is it that:


$delta^{(3)} = h_{theta} - y$ ?




  1. Shouldn't:


$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?










share|cite|improve this question











$endgroup$








  • 1




    $begingroup$
    The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
    $endgroup$
    – reuns
    Jun 15 '17 at 16:07








  • 1




    $begingroup$
    You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
    $endgroup$
    – reuns
    Jun 15 '17 at 16:18






  • 1




    $begingroup$
    This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 16:43






  • 1




    $begingroup$
    Related: link
    $endgroup$
    – user3658307
    Jun 15 '17 at 17:01






  • 1




    $begingroup$
    @Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 17:05














0












0








0


2



$begingroup$


I am following Andrew NG's Machine Learning course on Coursera.



The cost function without regularization used in the Neural network course is:



$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$



, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.



I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.



For simplicity, I considered m = K = 1:



$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$



and tried to prove this to myself on paper but wasn't able to.



Neural Network Definition:



This neural network has 3 layers. (1 input, 1 hidden, 1 output).



It uses the sigmoid activation function,



$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.



Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).



Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).



Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.



During backpropagation, $delta^{(3)}$ is the error associated with the output layer.



Question:




  1. Why is it that:


$delta^{(3)} = h_{theta} - y$ ?




  1. Shouldn't:


$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?










share|cite|improve this question











$endgroup$




I am following Andrew NG's Machine Learning course on Coursera.



The cost function without regularization used in the Neural network course is:



$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$



, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.



I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.



For simplicity, I considered m = K = 1:



$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$



and tried to prove this to myself on paper but wasn't able to.



Neural Network Definition:



This neural network has 3 layers. (1 input, 1 hidden, 1 output).



It uses the sigmoid activation function,



$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.



Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).



Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).



Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.



During backpropagation, $delta^{(3)}$ is the error associated with the output layer.



Question:




  1. Why is it that:


$delta^{(3)} = h_{theta} - y$ ?




  1. Shouldn't:


$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?







neural-networks






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Jun 15 '17 at 16:57







Roland

















asked Jun 15 '17 at 14:50









RolandRoland

73




73








  • 1




    $begingroup$
    The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
    $endgroup$
    – reuns
    Jun 15 '17 at 16:07








  • 1




    $begingroup$
    You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
    $endgroup$
    – reuns
    Jun 15 '17 at 16:18






  • 1




    $begingroup$
    This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 16:43






  • 1




    $begingroup$
    Related: link
    $endgroup$
    – user3658307
    Jun 15 '17 at 17:01






  • 1




    $begingroup$
    @Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 17:05














  • 1




    $begingroup$
    The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
    $endgroup$
    – reuns
    Jun 15 '17 at 16:07








  • 1




    $begingroup$
    You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
    $endgroup$
    – reuns
    Jun 15 '17 at 16:18






  • 1




    $begingroup$
    This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 16:43






  • 1




    $begingroup$
    Related: link
    $endgroup$
    – user3658307
    Jun 15 '17 at 17:01






  • 1




    $begingroup$
    @Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
    $endgroup$
    – The Great Duck
    Jun 15 '17 at 17:05








1




1




$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07






$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07






1




1




$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18




$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18




1




1




$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43




$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43




1




1




$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01




$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01




1




1




$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05




$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05










1 Answer
1






active

oldest

votes


















0












$begingroup$

First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
begin{align}
frac{partial J}{partial h_theta}
&= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
&= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
end{align}
Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
= frac{h_theta - y}{ h_theta(1-h_theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
begin{align}
delta^{(s)}_j
&= frac{partial J}{partial z_j^{(s)}}\
&= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
&= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
&= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
end{align}
using the fact that
$$
frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
= sigma'(z_j^{(s)})
= sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
= h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
delta^{(3)} = h_theta - y
$$






share|cite|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2323826%2fderivative-of-cost-function-for-neural-network-classifier%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
    begin{align}
    frac{partial J}{partial h_theta}
    &= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
    &= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
    &= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
    end{align}
    Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
    = frac{h_theta - y}{ h_theta(1-h_theta) } $$
    But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
    An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
    Let the last layer be $s$. Then the output layer error is:
    begin{align}
    delta^{(s)}_j
    &= frac{partial J}{partial z_j^{(s)}}\
    &= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
    &= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
    &= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
    &= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
    end{align}
    using the fact that
    $$
    frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
    = sigma'(z_j^{(s)})
    = sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
    = h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
    $$
    So in the case that $m=K=1$ and $s=3$, we have:
    $$
    delta^{(3)} = h_theta - y
    $$






    share|cite|improve this answer









    $endgroup$


















      0












      $begingroup$

      First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
      begin{align}
      frac{partial J}{partial h_theta}
      &= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
      &= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
      &= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
      end{align}
      Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
      = frac{h_theta - y}{ h_theta(1-h_theta) } $$
      But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
      An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
      Let the last layer be $s$. Then the output layer error is:
      begin{align}
      delta^{(s)}_j
      &= frac{partial J}{partial z_j^{(s)}}\
      &= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
      &= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
      &= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
      &= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
      end{align}
      using the fact that
      $$
      frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
      = sigma'(z_j^{(s)})
      = sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
      = h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
      $$
      So in the case that $m=K=1$ and $s=3$, we have:
      $$
      delta^{(3)} = h_theta - y
      $$






      share|cite|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
        begin{align}
        frac{partial J}{partial h_theta}
        &= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
        &= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
        &= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
        end{align}
        Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
        = frac{h_theta - y}{ h_theta(1-h_theta) } $$
        But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
        An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
        Let the last layer be $s$. Then the output layer error is:
        begin{align}
        delta^{(s)}_j
        &= frac{partial J}{partial z_j^{(s)}}\
        &= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
        &= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
        &= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
        &= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
        end{align}
        using the fact that
        $$
        frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
        = sigma'(z_j^{(s)})
        = sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
        = h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
        $$
        So in the case that $m=K=1$ and $s=3$, we have:
        $$
        delta^{(3)} = h_theta - y
        $$






        share|cite|improve this answer









        $endgroup$



        First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
        begin{align}
        frac{partial J}{partial h_theta}
        &= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
        &= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
        &= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
        end{align}
        Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
        = frac{h_theta - y}{ h_theta(1-h_theta) } $$
        But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
        An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
        Let the last layer be $s$. Then the output layer error is:
        begin{align}
        delta^{(s)}_j
        &= frac{partial J}{partial z_j^{(s)}}\
        &= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
        &= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
        &= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
        &= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
        end{align}
        using the fact that
        $$
        frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
        = sigma'(z_j^{(s)})
        = sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
        = h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
        $$
        So in the case that $m=K=1$ and $s=3$, we have:
        $$
        delta^{(3)} = h_theta - y
        $$







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Jun 15 '17 at 19:11









        user3658307user3658307

        4,6633946




        4,6633946






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2323826%2fderivative-of-cost-function-for-neural-network-classifier%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            Npm cannot find a required file even through it is in the searched directory