Derivative of Elementwise Function (working on a vector)
$begingroup$
I have seen an example (it is in terms of neural network back propagation) that I dont understand.
Given:
- $textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))
- $textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$
- $theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))
- $hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)
- $L=operatorname{xent}(y, hat{y})$ (definition)
The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:
$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$
The result they show makes perfect sense to me (almost)
$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$
My Questions:
- Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?
- Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?
calculus linear-algebra derivatives vector-spaces
$endgroup$
add a comment |
$begingroup$
I have seen an example (it is in terms of neural network back propagation) that I dont understand.
Given:
- $textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))
- $textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$
- $theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))
- $hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)
- $L=operatorname{xent}(y, hat{y})$ (definition)
The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:
$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$
The result they show makes perfect sense to me (almost)
$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$
My Questions:
- Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?
- Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?
calculus linear-algebra derivatives vector-spaces
$endgroup$
$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26
$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19
$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23
add a comment |
$begingroup$
I have seen an example (it is in terms of neural network back propagation) that I dont understand.
Given:
- $textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))
- $textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$
- $theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))
- $hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)
- $L=operatorname{xent}(y, hat{y})$ (definition)
The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:
$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$
The result they show makes perfect sense to me (almost)
$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$
My Questions:
- Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?
- Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?
calculus linear-algebra derivatives vector-spaces
$endgroup$
I have seen an example (it is in terms of neural network back propagation) that I dont understand.
Given:
- $textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))
- $textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$
- $theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))
- $hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)
- $L=operatorname{xent}(y, hat{y})$ (definition)
The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:
$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$
The result they show makes perfect sense to me (almost)
$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$
My Questions:
- Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?
- Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?
calculus linear-algebra derivatives vector-spaces
calculus linear-algebra derivatives vector-spaces
edited Apr 17 '16 at 13:19
John von N.
700513
700513
asked Apr 8 '16 at 19:52
B_MinerB_Miner
64111
64111
$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26
$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19
$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23
add a comment |
$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26
$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19
$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23
$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26
$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26
$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19
$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19
$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23
$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Allow me to restate the problem in terms of column vectors instead of row vectors
$$eqalign{
a &= W_1^Tx + b_1 &implies da = W_1^Tdx cr
h &= sigma(a) &implies dh = (H-H^2),da,,,,,&H={rm Diag}(h) cr
theta &= W_2^Th + b_2 &implies dtheta = W_2^Tdh cr
y &= {rm softmax}(theta) &implies dy = (Y-yy^T),dtheta,,,,,&Y={rm Diag}(y) cr
L &= -p:log y &implies (p,y) doteq (y,{hat y}) cr
}$$
Find the differential of the final (cross entropy) term, and then its gradient
$$eqalign{
dL
&= -p:Y^{-1}dy cr
&= -p:Y^{-1}(Y-yy^T)dtheta cr
&= -p:(I-1y^T)dtheta cr
&= (y1^T-I)p:dtheta cr
&= (y-p):W_2^Tdh cr
&= W_2(y-p):(H-H^2)da cr
&= (H-H^2)W_2(y-p):W_1^Tdx cr
&= W_1(H-H^2)W_2(y-p):dx cr
frac{partial L}{partial x} &= W_1(H-H^2)W_2(y-p) cr
}$$
In some of these steps, I used a colon to denote the trace/Frobenius product
$$A:B = {rm tr}(A^TB)$$
Casting the final result back into your preferred notation of row vectors and hats and Hadamard products, yields
$$eqalign{
frac{partial L}{partial x}
&= Big(({hat y}-y)W_2^TBig)circBig((h-hcirc h)W_1^TBig) cr
}$$
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1733814%2fderivative-of-elementwise-function-working-on-a-vector%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Allow me to restate the problem in terms of column vectors instead of row vectors
$$eqalign{
a &= W_1^Tx + b_1 &implies da = W_1^Tdx cr
h &= sigma(a) &implies dh = (H-H^2),da,,,,,&H={rm Diag}(h) cr
theta &= W_2^Th + b_2 &implies dtheta = W_2^Tdh cr
y &= {rm softmax}(theta) &implies dy = (Y-yy^T),dtheta,,,,,&Y={rm Diag}(y) cr
L &= -p:log y &implies (p,y) doteq (y,{hat y}) cr
}$$
Find the differential of the final (cross entropy) term, and then its gradient
$$eqalign{
dL
&= -p:Y^{-1}dy cr
&= -p:Y^{-1}(Y-yy^T)dtheta cr
&= -p:(I-1y^T)dtheta cr
&= (y1^T-I)p:dtheta cr
&= (y-p):W_2^Tdh cr
&= W_2(y-p):(H-H^2)da cr
&= (H-H^2)W_2(y-p):W_1^Tdx cr
&= W_1(H-H^2)W_2(y-p):dx cr
frac{partial L}{partial x} &= W_1(H-H^2)W_2(y-p) cr
}$$
In some of these steps, I used a colon to denote the trace/Frobenius product
$$A:B = {rm tr}(A^TB)$$
Casting the final result back into your preferred notation of row vectors and hats and Hadamard products, yields
$$eqalign{
frac{partial L}{partial x}
&= Big(({hat y}-y)W_2^TBig)circBig((h-hcirc h)W_1^TBig) cr
}$$
$endgroup$
add a comment |
$begingroup$
Allow me to restate the problem in terms of column vectors instead of row vectors
$$eqalign{
a &= W_1^Tx + b_1 &implies da = W_1^Tdx cr
h &= sigma(a) &implies dh = (H-H^2),da,,,,,&H={rm Diag}(h) cr
theta &= W_2^Th + b_2 &implies dtheta = W_2^Tdh cr
y &= {rm softmax}(theta) &implies dy = (Y-yy^T),dtheta,,,,,&Y={rm Diag}(y) cr
L &= -p:log y &implies (p,y) doteq (y,{hat y}) cr
}$$
Find the differential of the final (cross entropy) term, and then its gradient
$$eqalign{
dL
&= -p:Y^{-1}dy cr
&= -p:Y^{-1}(Y-yy^T)dtheta cr
&= -p:(I-1y^T)dtheta cr
&= (y1^T-I)p:dtheta cr
&= (y-p):W_2^Tdh cr
&= W_2(y-p):(H-H^2)da cr
&= (H-H^2)W_2(y-p):W_1^Tdx cr
&= W_1(H-H^2)W_2(y-p):dx cr
frac{partial L}{partial x} &= W_1(H-H^2)W_2(y-p) cr
}$$
In some of these steps, I used a colon to denote the trace/Frobenius product
$$A:B = {rm tr}(A^TB)$$
Casting the final result back into your preferred notation of row vectors and hats and Hadamard products, yields
$$eqalign{
frac{partial L}{partial x}
&= Big(({hat y}-y)W_2^TBig)circBig((h-hcirc h)W_1^TBig) cr
}$$
$endgroup$
add a comment |
$begingroup$
Allow me to restate the problem in terms of column vectors instead of row vectors
$$eqalign{
a &= W_1^Tx + b_1 &implies da = W_1^Tdx cr
h &= sigma(a) &implies dh = (H-H^2),da,,,,,&H={rm Diag}(h) cr
theta &= W_2^Th + b_2 &implies dtheta = W_2^Tdh cr
y &= {rm softmax}(theta) &implies dy = (Y-yy^T),dtheta,,,,,&Y={rm Diag}(y) cr
L &= -p:log y &implies (p,y) doteq (y,{hat y}) cr
}$$
Find the differential of the final (cross entropy) term, and then its gradient
$$eqalign{
dL
&= -p:Y^{-1}dy cr
&= -p:Y^{-1}(Y-yy^T)dtheta cr
&= -p:(I-1y^T)dtheta cr
&= (y1^T-I)p:dtheta cr
&= (y-p):W_2^Tdh cr
&= W_2(y-p):(H-H^2)da cr
&= (H-H^2)W_2(y-p):W_1^Tdx cr
&= W_1(H-H^2)W_2(y-p):dx cr
frac{partial L}{partial x} &= W_1(H-H^2)W_2(y-p) cr
}$$
In some of these steps, I used a colon to denote the trace/Frobenius product
$$A:B = {rm tr}(A^TB)$$
Casting the final result back into your preferred notation of row vectors and hats and Hadamard products, yields
$$eqalign{
frac{partial L}{partial x}
&= Big(({hat y}-y)W_2^TBig)circBig((h-hcirc h)W_1^TBig) cr
}$$
$endgroup$
Allow me to restate the problem in terms of column vectors instead of row vectors
$$eqalign{
a &= W_1^Tx + b_1 &implies da = W_1^Tdx cr
h &= sigma(a) &implies dh = (H-H^2),da,,,,,&H={rm Diag}(h) cr
theta &= W_2^Th + b_2 &implies dtheta = W_2^Tdh cr
y &= {rm softmax}(theta) &implies dy = (Y-yy^T),dtheta,,,,,&Y={rm Diag}(y) cr
L &= -p:log y &implies (p,y) doteq (y,{hat y}) cr
}$$
Find the differential of the final (cross entropy) term, and then its gradient
$$eqalign{
dL
&= -p:Y^{-1}dy cr
&= -p:Y^{-1}(Y-yy^T)dtheta cr
&= -p:(I-1y^T)dtheta cr
&= (y1^T-I)p:dtheta cr
&= (y-p):W_2^Tdh cr
&= W_2(y-p):(H-H^2)da cr
&= (H-H^2)W_2(y-p):W_1^Tdx cr
&= W_1(H-H^2)W_2(y-p):dx cr
frac{partial L}{partial x} &= W_1(H-H^2)W_2(y-p) cr
}$$
In some of these steps, I used a colon to denote the trace/Frobenius product
$$A:B = {rm tr}(A^TB)$$
Casting the final result back into your preferred notation of row vectors and hats and Hadamard products, yields
$$eqalign{
frac{partial L}{partial x}
&= Big(({hat y}-y)W_2^TBig)circBig((h-hcirc h)W_1^TBig) cr
}$$
answered Jun 3 '18 at 22:41
greggreg
7,8701821
7,8701821
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1733814%2fderivative-of-elementwise-function-working-on-a-vector%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26
$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19
$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23