Matrix derivatives, problem with dimensions

I'm trying to find a derivative of function:
$$L = f cdot y; f = X cdot W + b$$

Matrices shapes: $X.shape=(1, m), W.shape=(m,10), b.shape=(1, 10), y.shape=(10, 1)$
I'm looking for $frac{partial L}{partial W}$

According to chain-rule:
$$frac{partial L}{partial W} = frac{partial L}{partial f} frac{partial f}{partial W} $$

Separately we can find:
$$ frac{partial L}{partial f} = y$$
$$ frac{partial f}{partial W} = X$$

And the problem is that the derivative's dimension of $frac{partial L}{partial W} $ according to my formula is $(10, m)$. However, the dimension should coincide with dimension of $W$.

Also I was advised to find differential of $L$:

$$ d(L) = d(f cdot y) = d(f) cdot y = d (X cdot W + b)y = X cdot dW cdot y $$
But I do not understand how can I get from this the derivative $frac{partial L}{partial W} $

asked Jan 13 at 18:48

Dmitry Denisov

add a comment |

I'm trying to find a derivative of function:
$$L = f cdot y; f = X cdot W + b$$

Matrices shapes: $X.shape=(1, m), W.shape=(m,10), b.shape=(1, 10), y.shape=(10, 1)$
I'm looking for $frac{partial L}{partial W}$

According to chain-rule:
$$frac{partial L}{partial W} = frac{partial L}{partial f} frac{partial f}{partial W} $$

Separately we can find:
$$ frac{partial L}{partial f} = y$$
$$ frac{partial f}{partial W} = X$$

And the problem is that the derivative's dimension of $frac{partial L}{partial W} $ according to my formula is $(10, m)$. However, the dimension should coincide with dimension of $W$.

Also I was advised to find differential of $L$:

$$ d(L) = d(f cdot y) = d(f) cdot y = d (X cdot W + b)y = X cdot dW cdot y $$
But I do not understand how can I get from this the derivative $frac{partial L}{partial W} $

asked Jan 13 at 18:48

Dmitry Denisov

add a comment |

I'm trying to find a derivative of function:
$$L = f cdot y; f = X cdot W + b$$

Matrices shapes: $X.shape=(1, m), W.shape=(m,10), b.shape=(1, 10), y.shape=(10, 1)$
I'm looking for $frac{partial L}{partial W}$

According to chain-rule:
$$frac{partial L}{partial W} = frac{partial L}{partial f} frac{partial f}{partial W} $$

Separately we can find:
$$ frac{partial L}{partial f} = y$$
$$ frac{partial f}{partial W} = X$$

And the problem is that the derivative's dimension of $frac{partial L}{partial W} $ according to my formula is $(10, m)$. However, the dimension should coincide with dimension of $W$.

Also I was advised to find differential of $L$:

$$ d(L) = d(f cdot y) = d(f) cdot y = d (X cdot W + b)y = X cdot dW cdot y $$
But I do not understand how can I get from this the derivative $frac{partial L}{partial W} $

asked Jan 13 at 18:48

Dmitry Denisov

I'm trying to find a derivative of function:
$$L = f cdot y; f = X cdot W + b$$

Matrices shapes: $X.shape=(1, m), W.shape=(m,10), b.shape=(1, 10), y.shape=(10, 1)$
I'm looking for $frac{partial L}{partial W}$

According to chain-rule:
$$frac{partial L}{partial W} = frac{partial L}{partial f} frac{partial f}{partial W} $$

Separately we can find:
$$ frac{partial L}{partial f} = y$$
$$ frac{partial f}{partial W} = X$$

And the problem is that the derivative's dimension of $frac{partial L}{partial W} $ according to my formula is $(10, m)$. However, the dimension should coincide with dimension of $W$.

Also I was advised to find differential of $L$:

$$ d(L) = d(f cdot y) = d(f) cdot y = d (X cdot W + b)y = X cdot dW cdot y $$
But I do not understand how can I get from this the derivative $frac{partial L}{partial W} $

matrices derivatives chain-rule

asked Jan 13 at 18:48

Dmitry Denisov

asked Jan 13 at 18:48

Dmitry Denisov

asked Jan 13 at 18:48

Dmitry Denisov

asked Jan 13 at 18:48

Dmitry Denisov

asked Jan 13 at 18:48

Dmitry Denisov

add a comment |

1 Answer
1

active

oldest

votes

Let's use a convention where a lowercase Latin letter always represents a column vector, an uppercase Latin is a matrix, and a Greek letter is a scalar.

Using this convention your equations are
$$eqalign{
f &= W^Tx + b cr
lambda &= f^Ty cr
}$$
As you have noted, the differential of the scalar function is
$$eqalign{
dlambda &= df^Ty = (dW^Tx)^Ty = x^TdW,y cr
}$$
Let's develop that a bit further by introducing the Trace function
$$eqalign{
dlambda &= {rm Tr}(x^TdW,y) = {rm Tr}(yx^TdW) cr
}$$
Then, depending on your preferred Layout Convention, the gradient is either
$$eqalign{
frac{partiallambda}{partial W} &=yx^T quad{rm or}quad xy^T cr
}$$
Since you expected the the dimensions of the gradient to be those of $W$, it sounds like your preferred layout is $xy^T$

Also note that $frac{partial f}{partial W}neq X.,$ The gradient is a 3rd order tensor, while $X$ is just a 2nd order tensor (aka a matrix). The presence of these 3rd and 4th order tensors as intermediate quantities in the chain rule can make it difficult/impossible to use in practice.

The differential approach suggested by your advisor is often simpler because the differential of a matrix is just another matrix quantity, which is easy to handle.

edited Jan 14 at 19:53

answered Jan 14 at 19:41

greg

8,2751823

$begingroup$
Thank you very much for your answer, it became more clear for me now! I have 2 questions about your solution: 1) Do I understand correctly that you introduced Trace function, because dλ is scalar so const=Tr(const) ? 2) $frac{partial f}{partial W}$ is 3rd order tensor. Maybe in this case you know how chain rule works in Neural Networks? Because I get derivative from previous layer and I should multiply it by the derivative of current layer according to chain rule. However, as you mentioned, $frac{partial f}{partial W}$ now is a 3rd order tensor, so how can we apply chain rule?
$endgroup$
– Dmitry Denisov
Jan 14 at 22:13

$begingroup$
And even if X is matrix then f is also a matrix and we should take a derivative: matrix-by-matrix?
$endgroup$
– Dmitry Denisov
Jan 15 at 12:14

$begingroup$
@DmitryDenisov 1) Yes, ${rm Tr}(scalar)=scalar,,$ 2) the gradient really is a 3rd order tensor. The point is you never need to calculate 3rd order (vector-by-matrix) or 4th order (matrix-by-matrix) derivatives, and the programs you write will never calculate such quantities either. These online notes are worth a read.
$endgroup$
– greg
Jan 15 at 18:04

$begingroup$
In this row nabla_w[-1] = np.dot(delta, activations[-2].transpose()) they set $frac{partial L}{partial W}$ is equal to $X^T cdot delta$, so it doesn't seem like chain rule. I.e. in another case they also should calculate the derivative using differential on paper, however they stated that chain rule is a universal approach
$endgroup$
– Dmitry Denisov
Jan 16 at 9:57

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3072387%2fmatrix-derivatives-problem-with-dimensions%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Let's use a convention where a lowercase Latin letter always represents a column vector, an uppercase Latin is a matrix, and a Greek letter is a scalar.

The differential approach suggested by your advisor is often simpler because the differential of a matrix is just another matrix quantity, which is easy to handle.

edited Jan 14 at 19:53

answered Jan 14 at 19:41

greg

8,2751823

$begingroup$
Thank you very much for your answer, it became more clear for me now! I have 2 questions about your solution: 1) Do I understand correctly that you introduced Trace function, because dλ is scalar so const=Tr(const) ? 2) $frac{partial f}{partial W}$ is 3rd order tensor. Maybe in this case you know how chain rule works in Neural Networks? Because I get derivative from previous layer and I should multiply it by the derivative of current layer according to chain rule. However, as you mentioned, $frac{partial f}{partial W}$ now is a 3rd order tensor, so how can we apply chain rule?
$endgroup$
– Dmitry Denisov
Jan 14 at 22:13

$begingroup$
And even if X is matrix then f is also a matrix and we should take a derivative: matrix-by-matrix?
$endgroup$
– Dmitry Denisov
Jan 15 at 12:14

$begingroup$
@DmitryDenisov 1) Yes, ${rm Tr}(scalar)=scalar,,$ 2) the gradient really is a 3rd order tensor. The point is you never need to calculate 3rd order (vector-by-matrix) or 4th order (matrix-by-matrix) derivatives, and the programs you write will never calculate such quantities either. These online notes are worth a read.
$endgroup$
– greg
Jan 15 at 18:04

$begingroup$
In this row nabla_w[-1] = np.dot(delta, activations[-2].transpose()) they set $frac{partial L}{partial W}$ is equal to $X^T cdot delta$, so it doesn't seem like chain rule. I.e. in another case they also should calculate the derivative using differential on paper, however they stated that chain rule is a universal approach
$endgroup$
– Dmitry Denisov
Jan 16 at 9:57

add a comment |

Let's use a convention where a lowercase Latin letter always represents a column vector, an uppercase Latin is a matrix, and a Greek letter is a scalar.

The differential approach suggested by your advisor is often simpler because the differential of a matrix is just another matrix quantity, which is easy to handle.

edited Jan 14 at 19:53

answered Jan 14 at 19:41

greg

8,2751823

$begingroup$
Thank you very much for your answer, it became more clear for me now! I have 2 questions about your solution: 1) Do I understand correctly that you introduced Trace function, because dλ is scalar so const=Tr(const) ? 2) $frac{partial f}{partial W}$ is 3rd order tensor. Maybe in this case you know how chain rule works in Neural Networks? Because I get derivative from previous layer and I should multiply it by the derivative of current layer according to chain rule. However, as you mentioned, $frac{partial f}{partial W}$ now is a 3rd order tensor, so how can we apply chain rule?
$endgroup$
– Dmitry Denisov
Jan 14 at 22:13

$begingroup$
And even if X is matrix then f is also a matrix and we should take a derivative: matrix-by-matrix?
$endgroup$
– Dmitry Denisov
Jan 15 at 12:14

$begingroup$
@DmitryDenisov 1) Yes, ${rm Tr}(scalar)=scalar,,$ 2) the gradient really is a 3rd order tensor. The point is you never need to calculate 3rd order (vector-by-matrix) or 4th order (matrix-by-matrix) derivatives, and the programs you write will never calculate such quantities either. These online notes are worth a read.
$endgroup$
– greg
Jan 15 at 18:04

$begingroup$
In this row nabla_w[-1] = np.dot(delta, activations[-2].transpose()) they set $frac{partial L}{partial W}$ is equal to $X^T cdot delta$, so it doesn't seem like chain rule. I.e. in another case they also should calculate the derivative using differential on paper, however they stated that chain rule is a universal approach
$endgroup$
– Dmitry Denisov
Jan 16 at 9:57

add a comment |

Let's use a convention where a lowercase Latin letter always represents a column vector, an uppercase Latin is a matrix, and a Greek letter is a scalar.

The differential approach suggested by your advisor is often simpler because the differential of a matrix is just another matrix quantity, which is easy to handle.

edited Jan 14 at 19:53

answered Jan 14 at 19:41

greg

8,2751823

Let's use a convention where a lowercase Latin letter always represents a column vector, an uppercase Latin is a matrix, and a Greek letter is a scalar.

The differential approach suggested by your advisor is often simpler because the differential of a matrix is just another matrix quantity, which is easy to handle.

edited Jan 14 at 19:53

answered Jan 14 at 19:41

greg

8,2751823

edited Jan 14 at 19:53

answered Jan 14 at 19:41

greg

8,2751823

answered Jan 14 at 19:41

greg

8,2751823

answered Jan 14 at 19:41

greg

8,2751823

$begingroup$
Thank you very much for your answer, it became more clear for me now! I have 2 questions about your solution: 1) Do I understand correctly that you introduced Trace function, because dλ is scalar so const=Tr(const) ? 2) $frac{partial f}{partial W}$ is 3rd order tensor. Maybe in this case you know how chain rule works in Neural Networks? Because I get derivative from previous layer and I should multiply it by the derivative of current layer according to chain rule. However, as you mentioned, $frac{partial f}{partial W}$ now is a 3rd order tensor, so how can we apply chain rule?
$endgroup$
– Dmitry Denisov
Jan 14 at 22:13

$begingroup$
And even if X is matrix then f is also a matrix and we should take a derivative: matrix-by-matrix?
$endgroup$
– Dmitry Denisov
Jan 15 at 12:14

$begingroup$
@DmitryDenisov 1) Yes, ${rm Tr}(scalar)=scalar,,$ 2) the gradient really is a 3rd order tensor. The point is you never need to calculate 3rd order (vector-by-matrix) or 4th order (matrix-by-matrix) derivatives, and the programs you write will never calculate such quantities either. These online notes are worth a read.
$endgroup$
– greg
Jan 15 at 18:04

$begingroup$
In this row nabla_w[-1] = np.dot(delta, activations[-2].transpose()) they set $frac{partial L}{partial W}$ is equal to $X^T cdot delta$, so it doesn't seem like chain rule. I.e. in another case they also should calculate the derivative using differential on paper, however they stated that chain rule is a universal approach
$endgroup$
– Dmitry Denisov
Jan 16 at 9:57

add a comment |

$begingroup$
Thank you very much for your answer, it became more clear for me now! I have 2 questions about your solution: 1) Do I understand correctly that you introduced Trace function, because dλ is scalar so const=Tr(const) ? 2) $frac{partial f}{partial W}$ is 3rd order tensor. Maybe in this case you know how chain rule works in Neural Networks? Because I get derivative from previous layer and I should multiply it by the derivative of current layer according to chain rule. However, as you mentioned, $frac{partial f}{partial W}$ now is a 3rd order tensor, so how can we apply chain rule?
$endgroup$
– Dmitry Denisov
Jan 14 at 22:13

$begingroup$
And even if X is matrix then f is also a matrix and we should take a derivative: matrix-by-matrix?
$endgroup$
– Dmitry Denisov
Jan 15 at 12:14

$begingroup$
@DmitryDenisov 1) Yes, ${rm Tr}(scalar)=scalar,,$ 2) the gradient really is a 3rd order tensor. The point is you never need to calculate 3rd order (vector-by-matrix) or 4th order (matrix-by-matrix) derivatives, and the programs you write will never calculate such quantities either. These online notes are worth a read.
$endgroup$
– greg
Jan 15 at 18:04

$begingroup$
In this row nabla_w[-1] = np.dot(delta, activations[-2].transpose()) they set $frac{partial L}{partial W}$ is equal to $X^T cdot delta$, so it doesn't seem like chain rule. I.e. in another case they also should calculate the derivative using differential on paper, however they stated that chain rule is a universal approach
$endgroup$
– Dmitry Denisov
Jan 16 at 9:57

Thank you very much for your answer, it became more clear for me now! I have 2 questions about your solution: 1) Do I understand correctly that you introduced Trace function, because dλ is scalar so const=Tr(const) ? 2) $frac{partial f}{partial W}$ is 3rd order tensor. Maybe in this case you know how chain rule works in Neural Networks? Because I get derivative from previous layer and I should multiply it by the derivative of current layer according to chain rule. However, as you mentioned, $frac{partial f}{partial W}$ now is a 3rd order tensor, so how can we apply chain rule?

– Dmitry Denisov
Jan 14 at 22:13

And even if X is matrix then f is also a matrix and we should take a derivative: matrix-by-matrix?

– Dmitry Denisov
Jan 15 at 12:14

@DmitryDenisov 1) Yes, ${rm Tr}(scalar)=scalar,,$ 2) the gradient really is a 3rd order tensor. The point is you never need to calculate 3rd order (vector-by-matrix) or 4th order (matrix-by-matrix) derivatives, and the programs you write will never calculate such quantities either. These online notes are worth a read.

– greg
Jan 15 at 18:04

In this row nabla_w[-1] = np.dot(delta, activations[-2].transpose()) they set $frac{partial L}{partial W}$ is equal to $X^T cdot delta$, so it doesn't seem like chain rule. I.e. in another case they also should calculate the derivative using differential on paper, however they stated that chain rule is a universal approach

– Dmitry Denisov
Jan 16 at 9:57

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu