derivative of cost function for Neural Network classifier
$begingroup$
I am following Andrew NG's Machine Learning course on Coursera.
The cost function without regularization used in the Neural network course is:
$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$
, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.
I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.
For simplicity, I considered m = K = 1:
$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$
and tried to prove this to myself on paper but wasn't able to.
Neural Network Definition:
This neural network has 3 layers. (1 input, 1 hidden, 1 output).
It uses the sigmoid activation function,
$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.
Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).
Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).
Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.
During backpropagation, $delta^{(3)}$ is the error associated with the output layer.
Question:
- Why is it that:
$delta^{(3)} = h_{theta} - y$ ?
- Shouldn't:
$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?
neural-networks
$endgroup$
|
show 5 more comments
$begingroup$
I am following Andrew NG's Machine Learning course on Coursera.
The cost function without regularization used in the Neural network course is:
$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$
, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.
I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.
For simplicity, I considered m = K = 1:
$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$
and tried to prove this to myself on paper but wasn't able to.
Neural Network Definition:
This neural network has 3 layers. (1 input, 1 hidden, 1 output).
It uses the sigmoid activation function,
$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.
Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).
Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).
Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.
During backpropagation, $delta^{(3)}$ is the error associated with the output layer.
Question:
- Why is it that:
$delta^{(3)} = h_{theta} - y$ ?
- Shouldn't:
$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?
neural-networks
$endgroup$
1
$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07
1
$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18
1
$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43
1
$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01
1
$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05
|
show 5 more comments
$begingroup$
I am following Andrew NG's Machine Learning course on Coursera.
The cost function without regularization used in the Neural network course is:
$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$
, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.
I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.
For simplicity, I considered m = K = 1:
$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$
and tried to prove this to myself on paper but wasn't able to.
Neural Network Definition:
This neural network has 3 layers. (1 input, 1 hidden, 1 output).
It uses the sigmoid activation function,
$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.
Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).
Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).
Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.
During backpropagation, $delta^{(3)}$ is the error associated with the output layer.
Question:
- Why is it that:
$delta^{(3)} = h_{theta} - y$ ?
- Shouldn't:
$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?
neural-networks
$endgroup$
I am following Andrew NG's Machine Learning course on Coursera.
The cost function without regularization used in the Neural network course is:
$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$
, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.
I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.
For simplicity, I considered m = K = 1:
$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$
and tried to prove this to myself on paper but wasn't able to.
Neural Network Definition:
This neural network has 3 layers. (1 input, 1 hidden, 1 output).
It uses the sigmoid activation function,
$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.
Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).
Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).
Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.
During backpropagation, $delta^{(3)}$ is the error associated with the output layer.
Question:
- Why is it that:
$delta^{(3)} = h_{theta} - y$ ?
- Shouldn't:
$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?
neural-networks
neural-networks
edited Jun 15 '17 at 16:57
Roland
asked Jun 15 '17 at 14:50
RolandRoland
73
73
1
$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07
1
$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18
1
$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43
1
$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01
1
$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05
|
show 5 more comments
1
$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07
1
$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18
1
$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43
1
$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01
1
$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05
1
1
$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07
$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07
1
1
$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18
$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18
1
1
$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43
$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43
1
1
$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01
$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01
1
1
$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05
$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05
|
show 5 more comments
1 Answer
1
active
oldest
votes
$begingroup$
First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
begin{align}
frac{partial J}{partial h_theta}
&= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
&= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
end{align}
Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
= frac{h_theta - y}{ h_theta(1-h_theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
begin{align}
delta^{(s)}_j
&= frac{partial J}{partial z_j^{(s)}}\
&= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
&= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
&= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
end{align}
using the fact that
$$
frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
= sigma'(z_j^{(s)})
= sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
= h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
delta^{(3)} = h_theta - y
$$
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2323826%2fderivative-of-cost-function-for-neural-network-classifier%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
begin{align}
frac{partial J}{partial h_theta}
&= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
&= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
end{align}
Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
= frac{h_theta - y}{ h_theta(1-h_theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
begin{align}
delta^{(s)}_j
&= frac{partial J}{partial z_j^{(s)}}\
&= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
&= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
&= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
end{align}
using the fact that
$$
frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
= sigma'(z_j^{(s)})
= sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
= h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
delta^{(3)} = h_theta - y
$$
$endgroup$
add a comment |
$begingroup$
First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
begin{align}
frac{partial J}{partial h_theta}
&= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
&= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
end{align}
Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
= frac{h_theta - y}{ h_theta(1-h_theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
begin{align}
delta^{(s)}_j
&= frac{partial J}{partial z_j^{(s)}}\
&= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
&= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
&= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
end{align}
using the fact that
$$
frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
= sigma'(z_j^{(s)})
= sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
= h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
delta^{(3)} = h_theta - y
$$
$endgroup$
add a comment |
$begingroup$
First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
begin{align}
frac{partial J}{partial h_theta}
&= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
&= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
end{align}
Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
= frac{h_theta - y}{ h_theta(1-h_theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
begin{align}
delta^{(s)}_j
&= frac{partial J}{partial z_j^{(s)}}\
&= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
&= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
&= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
end{align}
using the fact that
$$
frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
= sigma'(z_j^{(s)})
= sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
= h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
delta^{(3)} = h_theta - y
$$
$endgroup$
First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
begin{align}
frac{partial J}{partial h_theta}
&= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
&= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
end{align}
Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
= frac{h_theta - y}{ h_theta(1-h_theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
begin{align}
delta^{(s)}_j
&= frac{partial J}{partial z_j^{(s)}}\
&= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
&= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
&= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
end{align}
using the fact that
$$
frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
= sigma'(z_j^{(s)})
= sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
= h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
delta^{(3)} = h_theta - y
$$
answered Jun 15 '17 at 19:11
user3658307user3658307
4,6633946
4,6633946
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2323826%2fderivative-of-cost-function-for-neural-network-classifier%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07
1
$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18
1
$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43
1
$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01
1
$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05