Derivative of Elementwise Function (working on a vector)

I have seen an example (it is in terms of neural network back propagation) that I dont understand.

Given:

$textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))

$textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$

$theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))

$hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)

$L=operatorname{xent}(y, hat{y})$ (definition)

The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:

$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$

The result they show makes perfect sense to me (almost)

$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$

My Questions:

Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?

Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?

edited Apr 17 '16 at 13:19

John von N.

700513

asked Apr 8 '16 at 19:52

B_Miner

64111

$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26

$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19

$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23

add a comment |

I have seen an example (it is in terms of neural network back propagation) that I dont understand.

Given:

$textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))

$textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$

$theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))

$hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)

$L=operatorname{xent}(y, hat{y})$ (definition)

The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:

$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$

The result they show makes perfect sense to me (almost)

$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$

My Questions:

Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?

Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?

edited Apr 17 '16 at 13:19

John von N.

700513

asked Apr 8 '16 at 19:52

B_Miner

64111

$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26

$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19

$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23

add a comment |

I have seen an example (it is in terms of neural network back propagation) that I dont understand.

Given:

$textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))

$textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$

$theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))

$hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)

$L=operatorname{xent}(y, hat{y})$ (definition)

The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:

$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$

The result they show makes perfect sense to me (almost)

$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$

My Questions:

Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?

Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?

edited Apr 17 '16 at 13:19

John von N.

700513

asked Apr 8 '16 at 19:52

B_Miner

64111

I have seen an example (it is in terms of neural network back propagation) that I dont understand.

Given:

$textbf{a} = textbf{x}textbf{W}_{1}+textbf{b}_{1} $ (where x is dimension (1x5), $W_1$ is (5x3) and $b_1$ is (1x3))

$textbf{h}=sigma(textbf{a})$ is the sigmoid function: $frac{1}{1+exp(-a_{i})}$ which acts on the n-dimensional vector $a$ element-wise, meaning $sigma(textbf{a}) =[sigma(a_{1}),sigma(a_{2}),...sigma(a_{n})]$

$theta = textbf{h}textbf{W}_{2}+textbf{b}_2$ (where h is dimension (1x3), $W_2$ is (3x5) and $b_2$ is (1x5))

$hat{textbf{y}}$= softmax($theta$) (where $hat{y}$ is dimension (1x5)) (definition)

$L=operatorname{xent}(y, hat{y})$ (definition)

The derivative of interest is $frac{partial L}{partial x}$ or by the chain rule:

$$frac{partial L}{partial{x}} =frac{partial L}{partial hat{y}}frac{partial hat{y}}{partial{theta}}frac{partial{theta}}{partial {h}}frac{partial{h}}{partial{a}}frac{partial{a}}{partial{x}}$$

The result they show makes perfect sense to me (almost)

$((hat{textbf{y}}-textbf{y}) textbf{W}_{2}^{T})circsigma'(a)textbf{W}_{1}$

My Questions:

Since $(hat{textbf{y}}-textbf{y})$ is dimension (1x5) they transpose $textbf{W}_{2}$ to conform to vector matrix multiplication. Is this OK? Can you just transpose a matrix when you want?

Why the elementwise multiplication by the derivative of $sigma(a)$ The rationale is that since $sigma$ is an elementwise operator, this is proper. I dont understand why you would not apply sigma to each element of $textbf{a}$ and then matrix multiply this result against the vector on the left?

calculus linear-algebra derivatives vector-spaces

edited Apr 17 '16 at 13:19

John von N.

700513

asked Apr 8 '16 at 19:52

B_Miner

64111

edited Apr 17 '16 at 13:19

John von N.

700513

asked Apr 8 '16 at 19:52

B_Miner

64111

edited Apr 17 '16 at 13:19

John von N.

700513

edited Apr 17 '16 at 13:19

John von N.

700513

edited Apr 17 '16 at 13:19

John von N.

700513

asked Apr 8 '16 at 19:52

B_Miner

64111

asked Apr 8 '16 at 19:52

B_Miner

64111

asked Apr 8 '16 at 19:52

B_Miner

64111

$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26

$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19

$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23

add a comment |

$begingroup$
I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?
$endgroup$
– leonbloy
Apr 8 '16 at 23:26

$begingroup$
Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.
$endgroup$
– B_Miner
Apr 9 '16 at 1:19

$begingroup$
Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)
$endgroup$
– B_Miner
Apr 9 '16 at 1:23

I tried to reformat a little the equations. The solution has a ${bf y}$ which does not appear in the formulas before. What's that? Further, the $log$ is to be taken elemenwise? Further, what's $circ$ ?

– leonbloy
Apr 8 '16 at 23:26

Hi. $y$ is the actual label. It comes from the definition of softmax and cross entropy. If you fully solve the derivative of -log(y_hat) w.r.t. theta it equals (y_hat - y). That part is not really relevant (that was essentially given as part of a previous derivation). I only included it here because it does lead into the transpose of W_2. The circle symbol denotes element-wise multiplication.

– B_Miner
Apr 9 '16 at 1:19

Thank you for reformatting - my latex skills are poor. It took me an hour and I couldnt get the partials right :)

– B_Miner
Apr 9 '16 at 1:23

add a comment |

1 Answer
1

active

oldest

votes

Allow me to restate the problem in terms of column vectors instead of row vectors
$$eqalign{
a &= W_1^Tx + b_1 &implies da = W_1^Tdx cr
h &= sigma(a) &implies dh = (H-H^2),da,,,,,&H={rm Diag}(h) cr
theta &= W_2^Th + b_2 &implies dtheta = W_2^Tdh cr
y &= {rm softmax}(theta) &implies dy = (Y-yy^T),dtheta,,,,,&Y={rm Diag}(y) cr
L &= -p:log y &implies (p,y) doteq (y,{hat y}) cr
}$$
Find the differential of the final (cross entropy) term, and then its gradient
$$eqalign{
dL
&= -p:Y^{-1}dy cr
&= -p:Y^{-1}(Y-yy^T)dtheta cr
&= -p:(I-1y^T)dtheta cr
&= (y1^T-I)p:dtheta cr
&= (y-p):W_2^Tdh cr
&= W_2(y-p):(H-H^2)da cr
&= (H-H^2)W_2(y-p):W_1^Tdx cr
&= W_1(H-H^2)W_2(y-p):dx cr
frac{partial L}{partial x} &= W_1(H-H^2)W_2(y-p) cr
}$$
In some of these steps, I used a colon to denote the trace/Frobenius product
$$A:B = {rm tr}(A^TB)$$
Casting the final result back into your preferred notation of row vectors and hats and Hadamard products, yields
$$eqalign{
frac{partial L}{partial x}
&= Big(({hat y}-y)W_2^TBig)circBig((h-hcirc h)W_1^TBig) cr
}$$

answered Jun 3 '18 at 22:41

greg

7,8701821

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1733814%2fderivative-of-elementwise-function-working-on-a-vector%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Jun 3 '18 at 22:41

greg

7,8701821

add a comment |

answered Jun 3 '18 at 22:41

greg

7,8701821

add a comment |

answered Jun 3 '18 at 22:41

greg

7,8701821

answered Jun 3 '18 at 22:41

greg

7,8701821

answered Jun 3 '18 at 22:41

greg

7,8701821

answered Jun 3 '18 at 22:41

greg

7,8701821

answered Jun 3 '18 at 22:41

greg

7,8701821

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu