derivative of cost function for Neural Network classifier

I am following Andrew NG's Machine Learning course on Coursera.

The cost function without regularization used in the Neural network course is:

$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$

, where $m$ is the number of examples, $K$ is the number of classes, $J(theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $theta$ are the weight matrices and $h_{theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.

I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.

For simplicity, I considered m = K = 1:

$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$

and tried to prove this to myself on paper but wasn't able to.

Neural Network Definition:

This neural network has 3 layers. (1 input, 1 hidden, 1 output).

It uses the sigmoid activation function,

$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.

Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).

Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).

Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.

During backpropagation, $delta^{(3)}$ is the error associated with the output layer.

Question:

Why is it that:

$delta^{(3)} = h_{theta} - y$ ?

Shouldn't:

$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?

edited Jun 15 '17 at 16:57

asked Jun 15 '17 at 14:50

Roland

1

$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07

1

$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18

1

$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43

1

$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01

1

$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05

|
show 5 more comments

I am following Andrew NG's Machine Learning course on Coursera.

The cost function without regularization used in the Neural network course is:

$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$

I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.

For simplicity, I considered m = K = 1:

$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$

and tried to prove this to myself on paper but wasn't able to.

Neural Network Definition:

This neural network has 3 layers. (1 input, 1 hidden, 1 output).

It uses the sigmoid activation function,

$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.

Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).

Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).

Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.

During backpropagation, $delta^{(3)}$ is the error associated with the output layer.

Question:

Why is it that:

$delta^{(3)} = h_{theta} - y$ ?

Shouldn't:

$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?

edited Jun 15 '17 at 16:57

asked Jun 15 '17 at 14:50

Roland

1

$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07

1

$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18

1

$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43

1

$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01

1

$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05

|
show 5 more comments

I am following Andrew NG's Machine Learning course on Coursera.

The cost function without regularization used in the Neural network course is:

$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$

I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.

For simplicity, I considered m = K = 1:

$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$

and tried to prove this to myself on paper but wasn't able to.

Neural Network Definition:

This neural network has 3 layers. (1 input, 1 hidden, 1 output).

It uses the sigmoid activation function,

$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.

Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).

Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).

Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.

During backpropagation, $delta^{(3)}$ is the error associated with the output layer.

Question:

Why is it that:

$delta^{(3)} = h_{theta} - y$ ?

Shouldn't:

$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?

edited Jun 15 '17 at 16:57

asked Jun 15 '17 at 14:50

Roland

I am following Andrew NG's Machine Learning course on Coursera.

The cost function without regularization used in the Neural network course is:

$J(theta) = frac{1}{m} sum ^{m}_{i=1}sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{theta}(x^{(i)}))_{k})]$

I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.

For simplicity, I considered m = K = 1:

$J(theta) = -y log(h_{theta}) - (1-y) log(1-h_{theta})$

and tried to prove this to myself on paper but wasn't able to.

Neural Network Definition:

This neural network has 3 layers. (1 input, 1 hidden, 1 output).

It uses the sigmoid activation function,

$sigma(z) = frac{1}{1+e^{-z}}$.
The input is $x$.

Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).

Hidden Layer: $z^{(2)} = Theta^{(1)}a^{(1)}$ , $a^{2} = sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).

Output layer: $z^{(3)} = Theta^{(2)}a^{(2)}$ , $a^{3} = sigma(z^{(3)}) = h_{theta}(x)$.

During backpropagation, $delta^{(3)}$ is the error associated with the output layer.

Question:

Why is it that:

$delta^{(3)} = h_{theta} - y$ ?

Shouldn't:

$delta^{(3)} = frac{partial {J}} {partial {h_{theta}}}$ ?

neural-networks

edited Jun 15 '17 at 16:57

asked Jun 15 '17 at 14:50

Roland

edited Jun 15 '17 at 16:57

asked Jun 15 '17 at 14:50

Roland

edited Jun 15 '17 at 16:57

asked Jun 15 '17 at 14:50

Roland

asked Jun 15 '17 at 14:50

Roland

asked Jun 15 '17 at 14:50

Roland

1

$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07

1

$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18

1

$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43

1

$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01

1

$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05

|
show 5 more comments

1

$begingroup$
The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.
$endgroup$
– reuns
Jun 15 '17 at 16:07

1

$begingroup$
You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$
$endgroup$
– reuns
Jun 15 '17 at 16:18

1

$begingroup$
This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.
$endgroup$
– The Great Duck
Jun 15 '17 at 16:43

1

$begingroup$
Related: link
$endgroup$
– user3658307
Jun 15 '17 at 17:01

1

$begingroup$
@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.
$endgroup$
– The Great Duck
Jun 15 '17 at 17:05

The notations are horrible. Let $z^{(i)} = h_theta(x^{(i)})$ the output for the $i$th input $x^{(i)}$. What we want (to apply the gradient descent) is $frac{partial J}{partial theta_{lj}}$, and for this we look at $frac{partial J}{partial z^{(i)}_j}$ and $frac{partial z^{(i)}_j}{theta_{lj}}$. Also you didn't define your neural network (ie. $h_theta$) only the objective function.

– reuns
Jun 15 '17 at 16:07

You meant $$h_theta(1-h_theta) frac{partial J}{partial h_{theta}} = h_{theta}-y$$

– reuns
Jun 15 '17 at 16:18

This question is missing context or other details: Please improve the question by providing additional context, which ideally includes your thoughts on the problem and any attempts you have made to solve it. This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level.

– The Great Duck
Jun 15 '17 at 16:43

Related: link

– user3658307
Jun 15 '17 at 17:01

@Roland why are pinging me? That is an auto-generated post because i have flagged your question. You're question lacks context and that is for you to fix. Don't ask me to fix your "homework" question.

– The Great Duck
Jun 15 '17 at 17:05

|
show 5 more comments

1 Answer
1

active

oldest

votes

First, since your cost function is using the binary cross-entropy error $mathcal{H}$ with a sigmoid activation $sigma$, you can see that:
begin{align}
frac{partial J}{partial h_theta}
&= frac{1}{m}sum_isum_kfrac{partial }{partial h_theta}mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{1}{m}sum_isum_k left[ frac{-y_k^{(i)}}{h_theta(x^{(i)})_k} + frac{1-y_k^{(i)}}{1-h_theta(x^{(i)})_k} right] \
&= frac{1}{m}sum_isum_k frac{h_theta(x^{(i)})_k - y_k^{(i)}}{ h_theta(x^{(i)})_k(1-h_theta(x^{(i)})_k) }
end{align}
Hence, for $m=K=1$, as a commenter notes $$ frac{partial J}{partial h_theta}
= frac{h_theta - y}{ h_theta(1-h_theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $theta^{(ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_theta=sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
begin{align}
delta^{(s)}_j
&= frac{partial J}{partial z_j^{(s)}}\
&= frac{1}{m}sum_isum_k frac{partial }{partial z_j^{(s)}} mathcal{H}left(y_k^{(i)},h_theta(x^{(i)})_kright) \
&= frac{-1}{m}sum_isum_k y_k^{(i)} frac{1}{h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} + (1-y_k^{(i)})frac{1}{1-h_theta(x^{(i)})_k}frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}} \
&= frac{-1}{m}sum_isum_k [1-h_theta(x^{(i)})_k]y_k^{(i)} - h_theta(x^{(i)})_k[1-y_k^{(i)}]\
&= frac{1}{m}sum_isum_k h_theta(x^{(i)})_k -y_k^{(i)}
end{align}
using the fact that
$$
frac{partial h_theta(x^{(i)})_k}{partial z_j^{(s)}}
= sigma'(z_j^{(s)})
= sigma(z_j^{(s)})[1-sigma(z_j^{(s)})]
= h_theta(x^{(i)})_k[1-h_theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
delta^{(3)} = h_theta - y
$$

answered Jun 15 '17 at 19:11

user3658307

4,6633946

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2323826%2fderivative-of-cost-function-for-neural-network-classifier%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Jun 15 '17 at 19:11

user3658307

4,6633946

add a comment |

answered Jun 15 '17 at 19:11

user3658307

4,6633946

add a comment |

answered Jun 15 '17 at 19:11

user3658307

4,6633946

answered Jun 15 '17 at 19:11

user3658307

4,6633946

answered Jun 15 '17 at 19:11

user3658307

4,6633946

answered Jun 15 '17 at 19:11

user3658307

4,6633946

answered Jun 15 '17 at 19:11

user3658307

4,6633946

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu