Derivative of the nuclear norm with respect to its argument












15












$begingroup$


The nuclear norm is defined in the following way



$$|X|_*=mathrm{tr} left(sqrt{X^T X} right)$$



I'm trying to take the derivative of the nuclear norm with respect to its argument



$$frac{partial |X|_*}{partial X}$$



Note that $|X|_*$ is a norm and is convex. I'm using this for some coordinate descent optimization algorithm. Thank you for your help.










share|cite|improve this question











$endgroup$








  • 1




    $begingroup$
    Is there a reason you need the derivative? In a convex optimization setting, this function would likely be handled using a semidefinite transformation or perhaps a projection.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 2:47










  • $begingroup$
    I will add an answer describing these alternate approaches.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 15:27










  • $begingroup$
    You can get the proof from the reference Characterization of the subdifferential of some matrix norm, Linear Algebra Appl., 170(1992),pp.33-45.
    $endgroup$
    – askuyue
    Sep 3 '16 at 8:56
















15












$begingroup$


The nuclear norm is defined in the following way



$$|X|_*=mathrm{tr} left(sqrt{X^T X} right)$$



I'm trying to take the derivative of the nuclear norm with respect to its argument



$$frac{partial |X|_*}{partial X}$$



Note that $|X|_*$ is a norm and is convex. I'm using this for some coordinate descent optimization algorithm. Thank you for your help.










share|cite|improve this question











$endgroup$








  • 1




    $begingroup$
    Is there a reason you need the derivative? In a convex optimization setting, this function would likely be handled using a semidefinite transformation or perhaps a projection.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 2:47










  • $begingroup$
    I will add an answer describing these alternate approaches.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 15:27










  • $begingroup$
    You can get the proof from the reference Characterization of the subdifferential of some matrix norm, Linear Algebra Appl., 170(1992),pp.33-45.
    $endgroup$
    – askuyue
    Sep 3 '16 at 8:56














15












15








15


23



$begingroup$


The nuclear norm is defined in the following way



$$|X|_*=mathrm{tr} left(sqrt{X^T X} right)$$



I'm trying to take the derivative of the nuclear norm with respect to its argument



$$frac{partial |X|_*}{partial X}$$



Note that $|X|_*$ is a norm and is convex. I'm using this for some coordinate descent optimization algorithm. Thank you for your help.










share|cite|improve this question











$endgroup$




The nuclear norm is defined in the following way



$$|X|_*=mathrm{tr} left(sqrt{X^T X} right)$$



I'm trying to take the derivative of the nuclear norm with respect to its argument



$$frac{partial |X|_*}{partial X}$$



Note that $|X|_*$ is a norm and is convex. I'm using this for some coordinate descent optimization algorithm. Thank you for your help.







linear-algebra derivatives norm matrix-calculus nuclear-norm






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited May 6 '18 at 10:14









Rodrigo de Azevedo

12.9k41857




12.9k41857










asked Mar 6 '14 at 0:19









AltAlt

1,71931234




1,71931234








  • 1




    $begingroup$
    Is there a reason you need the derivative? In a convex optimization setting, this function would likely be handled using a semidefinite transformation or perhaps a projection.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 2:47










  • $begingroup$
    I will add an answer describing these alternate approaches.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 15:27










  • $begingroup$
    You can get the proof from the reference Characterization of the subdifferential of some matrix norm, Linear Algebra Appl., 170(1992),pp.33-45.
    $endgroup$
    – askuyue
    Sep 3 '16 at 8:56














  • 1




    $begingroup$
    Is there a reason you need the derivative? In a convex optimization setting, this function would likely be handled using a semidefinite transformation or perhaps a projection.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 2:47










  • $begingroup$
    I will add an answer describing these alternate approaches.
    $endgroup$
    – Michael Grant
    Mar 8 '14 at 15:27










  • $begingroup$
    You can get the proof from the reference Characterization of the subdifferential of some matrix norm, Linear Algebra Appl., 170(1992),pp.33-45.
    $endgroup$
    – askuyue
    Sep 3 '16 at 8:56








1




1




$begingroup$
Is there a reason you need the derivative? In a convex optimization setting, this function would likely be handled using a semidefinite transformation or perhaps a projection.
$endgroup$
– Michael Grant
Mar 8 '14 at 2:47




$begingroup$
Is there a reason you need the derivative? In a convex optimization setting, this function would likely be handled using a semidefinite transformation or perhaps a projection.
$endgroup$
– Michael Grant
Mar 8 '14 at 2:47












$begingroup$
I will add an answer describing these alternate approaches.
$endgroup$
– Michael Grant
Mar 8 '14 at 15:27




$begingroup$
I will add an answer describing these alternate approaches.
$endgroup$
– Michael Grant
Mar 8 '14 at 15:27












$begingroup$
You can get the proof from the reference Characterization of the subdifferential of some matrix norm, Linear Algebra Appl., 170(1992),pp.33-45.
$endgroup$
– askuyue
Sep 3 '16 at 8:56




$begingroup$
You can get the proof from the reference Characterization of the subdifferential of some matrix norm, Linear Algebra Appl., 170(1992),pp.33-45.
$endgroup$
– askuyue
Sep 3 '16 at 8:56










8 Answers
8






active

oldest

votes


















20












$begingroup$

As I said in my comment, in a convex optimization setting, one would normally not use the derivative/subgradient of the nuclear norm function. It is, after all, nondifferentiable, and as such cannot be used in standard descent approaches (though I suspect some people have probably applied semismooth methods to it).



Here are two alternate approaches for "handling" the nuclear norm.



Semidefinite programming. We can use the following identity: the nuclear norm inequality $|X|_*leq y$ is satisfied if and only if there exist symmetric matrices $W_1$, $W_2$ satisfying
$$begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0, ~ mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 leq 2 y$$
Here, $succeq 0$ should be interpreted to mean that the $2times 2$ block matrix is positive semidefinite. Because of this transformation, you can handle nuclear norm minimization or upper bounds on the nuclear norm in any semidefinite programming setting.
For instance, given some equality constraints $mathcal{A}(X)=b$ where $mathcal{A}$ is a linear operator, you could do this:
$$begin{array}{ll}
text{minimize} & |X|_* \
text{subject to} & mathcal{A}(X)=b end{array}
quadLongleftrightarrowquad
begin{array}{ll}
text{minimize} & tfrac{1}{2}left( mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 right) \
text{subject to} & begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0 \ & mathcal{A}(X)=b end{array}
$$
My software CVX uses this transformation to implement the function norm_nuc, but any semidefinite programming software can handle this. One downside to this method is that semidefinite programming can be expensive; and if $mll n$ or $nll m$, that expense is exacerbated, since size of the linear matrix inequality is $(m+n)times (m+n)$.



Projected/proximal gradients. Consider the following related problems:
$$begin{array}{ll}
text{minimize} & |mathcal{A}(X)-b|_2^2 \
text{subject to} & |X|_*leq delta
end{array} quad
$$ $$text{minimize} ~~ |mathcal{A}(X)-b|_2^2+lambda|X|_*$$
Both of these problems trace out tradeoff curves: as $delta$ or $lambda$ is varied, you generate a tradeoff between $|mathcal{A}(X)-b|$ and $|X|_*$. In a very real sense, these problems are equivalent: for a fixed value of $delta$, there is going to be a corresponding value of $lambda$ that yields the exact same value of $X$ (at least on the interior of the tradeoff curve). So it is worth considering these problems together.



The first of these problems can be solved using a projected gradient approach. This approach alternates between gradient steps on the smooth objective and projections back onto the feasible set $|X|_*leq delta$. The projection step requires being able to compute
$$mathop{textrm{Proj}}(Y) = mathop{textrm{arg min}}_{{X,|,|X|_*leqdelta}} | X - Y |$$
which can be done at about the cost of a single SVD plus some $O(n)$ operations.



The second model can be solved using a proximal gradient approach, which is very closely related to projected gradients. In this case, you alternate between taking gradient steps on the smooth portion, followed by an evaluation of the proximal function
$$mathop{textrm{Prox}}(Y) = mathop{textrm{arg min}}_X |X|_* + tfrac{1}{2}t^{-1}|X-Y|^2$$
where $t$ is a step size. This function can also be computed with a single SVD and some thresholding. It's actually easier to implement than the projection. For that reason, the proximal model is preferable to the projection model. When you have the choice, solve the easier model!



I would encourage you to do a literature search on proximal gradient methods, and nuclear norm problems in particular. There is actually quite a bit of work out there on this. For example, these lecture notes by Laurent El Ghaoui at Berkeley talk about the proximal gradient method and introduce the prox function for nuclear norms. My software TFOCS includes both the nuclear norm projection and the prox function. You do not have to use this software, but you could look at the implementations of prox_nuclear and proj_nuclear for some hints.






share|cite|improve this answer











$endgroup$





















    17












    $begingroup$

    Start with the SVD decomposition of $x$:



    $$x=USigma V^T$$



    Then $$|x|_*=tr(sqrt{x^Tx})=tr(sqrt{(USigma V^T)^T(USigma V^T)})$$



    $$Rightarrow |x|_*=tr(sqrt{VSigma U^T USigma V^T})=tr(sqrt{VSigma^2V^T})$$



    By circularity of trace:



    $$Rightarrow |x|_*=tr(sqrt{V^TVSigma^2})=tr(sqrt{V^TVSigma^2})=tr(sqrt{Sigma^2})=tr(Sigma)$$



    Since the elements of $Sigma$ are non-negative.



    Therefore nuclear norm can be also defined as the sum of the absolute values of the singular value decomposition of the input matrix.



    Now, note that the absolute value function is not differentiable on every point in its domain, but you can find a subgradient.



    $$frac{partial |x|_*}{partial x}=frac{partial tr(Sigma)}{partial x}=frac{ tr(partialSigma)}{partial x}$$



    You should find $partialSigma$. Since $Sigma$ is diagonal, the subdifferential set of $Sigma$ is: $partialSigma=SigmaSigma^{-1}partialSigma$, now we have:



    $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1}partialSigma)}{partial x}$$ (I)



    So we should find $partialSigma$.



    $x=USigma V^T$, therefore:
    $$partial x=partial USigma V^T+UpartialSigma V^T+USigmapartial V^T$$



    Therefore:



    $$UpartialSigma V^T=partial x-partial USigma V^T-USigmapartial V^T$$



    $$Rightarrow U^TUpartialSigma V^TV=U^Tpartial xV-U^Tpartial USigma V^TV-U^TUSigmapartial V^TV$$



    $$Rightarrow partialSigma =U^Tpartial xV-U^Tpartial USigma - Sigmapartial V^TV$$



    You can show that $-U^Tpartial USigma - Sigmapartial V^TV=0$ (Hint: diagonal and antisymmetric matrices, I skip the proof here.), therefore:



    $$partialSigma =U^Tpartial xV$$



    By substitution into (I):



    $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1} partialSigma)}{partial x}=frac{ tr(U^Tpartial xV)}{partial x}=frac{ tr(VU^Tpartial x)}{partial x}=(VU^T)^T$$



    Therefore you can use $U V^T$ as the subgradient.






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      The diagonal of $Sigma$ is already nonnegative. No need for the absolute value.
      $endgroup$
      – Rodrigo de Azevedo
      Dec 21 '16 at 20:27










    • $begingroup$
      Thanks, I fixed it @RodrigodeAzevedo
      $endgroup$
      – Alt
      Dec 21 '16 at 23:48












    • $begingroup$
      Alt, I am trying to understand why taking the nuclear norm is not differentiable. You said it is because of the absolute values, but as @Rodrigo de Avezedo pointed out, the Sigma is already non-negative. Given that there is no absolute value, why is it not differentiable?
      $endgroup$
      – The_Anomaly
      Jan 17 '18 at 19:40






    • 1




      $begingroup$
      @The_Anomaly, the singular values are non-negative, but we are taking the derivative with respect to the Matrix $x$. For example, if $x$ is a $1times 1$ matrix, then the nuclear norm is equivalent to the absolute value of $x$, which is non-diferentiable.
      $endgroup$
      – Alt
      Jan 17 '18 at 20:50



















    2












    $begingroup$

    You can use this nice result for the differential of the trace
    $$ eqalign {
    d,mathrm{tr}(f(A)) &= f'(A^T):dA cr
    } $$
    to write
    $$ eqalign {
    d,mathrm{tr}((x^Tx)^{frac {1} {2}}) &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:d(x^Tx) cr
    &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:(dx^T x + x^T dx) cr
    &= x(x^Tx)^{-frac {1} {2}}: dx cr
    } $$
    Yielding the derivative as
    $$ eqalign {
    frac {partial|x|_*} {partial x} &= x(x^Tx)^{-frac {1} {2}} cr
    } $$
    Another nice result (this one's from Higham)
    $$ eqalign {
    A,f(B,A) &= f(A,B),A cr
    } $$
    yields an alternative expression with (potentially) smaller dimensions
    $$ eqalign {
    frac {partial|x|_*} {partial x} &= (x,x^T)^{-frac {1} {2}}x cr
    } $$



    While the square root of $x^Tx$ certainly exists, the inverse may not. So you might need some sort of regularization, e.g.
    $$ eqalign {
    frac {partial|x|_*} {partial x} &= x(x^Tx+varepsilon I)^{-frac {1} {2}} cr
    } $$






    share|cite|improve this answer









    $endgroup$









    • 2




      $begingroup$
      I believe the colon notation represents the Frobenius product [en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product].
      $endgroup$
      – lynne
      Oct 21 '14 at 0:47



















    2












    $begingroup$

    Of course, $n:xin M_{n,p}rightarrow tr(sqrt{x^Tx})$ can be derived in $x$ s.t. $x^Tx$ is invertible, that is, in the generic case when $ngeq p$ (if $nleq p$, then consider $tr(sqrt{xx^T})$). The result of greg is correct ; yet, his proof is unclear and I rewrite it for convenience.



    If $A$ is symmetric $>0$, then $f:Arightarrow sqrt{A}$ is a matrix function (cf. the Higham's book about this subject) ; if $g$ is a matrix function and $phi:Arightarrow tr(g(A))$, then its derivative is $Dphi_A:Krightarrow tr(g'(A)K)$. Let $A=x^Tx$. Thus $Dn_x:Hrightarrow tr(f'(A)(H^Tx+x^TH))=tr((f'(A)^Tx^T+f'(A)x^T)H)$. Then the gradient of $n$ is $nabla(n)(x)=x(f'(A)+f'(A)^T)=2xf'(A)=x(x^Tx)^{-1/2}$.



    As Alt did, we can use the SVD decomposition and we find $nabla(n)(x)=USigma (Sigma^TSigma)^{-1/2}V^T$ ($=UV^T$ if $n=p$). Recall to Alt that the diagonal of $Sigma$ is $geq 0$.






    share|cite|improve this answer











    $endgroup$





















      1












      $begingroup$

      Alt's answer has a fundamental error. First of all, the nuclear norm is the sum of all singular values not the absolute of the singular values.



      To make it right, we need to first define the square root for matrix as $sqrt{BB}=B$. As Alt shown,



      $||x||_*=tr(sqrt{VSigma^2V^T})$



      But we cannot use the circularity of trace here because it is not well defined.



      We should do something like this,



      $||x||_*=tr(sqrt{VSigma^2V^T})=tr(sqrt{VSigma V^TVSigma V^T})=tr(VSigma V^T)$,



      the last equality is based on the definition of the square root for matrix described above. Then by the circularity of trace, we get



      $tr(VSigma V^T)=tr(Sigma V^TV)=tr(Sigma)=sum_i sigma_i$.






      share|cite|improve this answer









      $endgroup$













      • $begingroup$
        Does this mean it can be negative? For real numbers we often define $sqrt(x^2):=|x|$. What is the equivalent for matrices?
        $endgroup$
        – Harsh
        Aug 21 '18 at 16:08





















      1












      $begingroup$

      Short answer



      Nuclear norm has subgradients (w.r.t. its argument). You may use $UV^top$ in your algorithm if you need one.



      See https://math.stackexchange.com/a/1016743/351390 for where it is actually differentiable. You should also see the comment by loup blanc below.



      Details



      The elements in $partial|X|_*$ can be characterized as a sum of two parts:
      Let $X = U Sigma V^top$ be a (skinny) singular value decomposition, then



      $$Y in partial |X|_* quad Leftrightarrow quad Y = UV^top + W text{ for some } W in T^perp$$



      where the definition of subspace (of matrices) $T$ is a bit complicated:
      $$T :=text{span}({x v_j^top| j in {1, ldots, m}; x text{ is any vector}} cup {u_i y^top| i in {1, ldots, n}; y text{ is any vector}})$$
      However, you don't need to pay too much attention to it, because obviously $0$ is an element of $T^perp$, therefore at least
      $$UV^top in partial|X|_*$$



      For proof, as commented by askuyue, see https://www.sciencedirect.com/science/article/pii/0024379592904072






      share|cite|improve this answer











      $endgroup$













      • $begingroup$
        Welcome to MSE. This should be a comment, not a answer.
        $endgroup$
        – José Carlos Santos
        Apr 8 '18 at 11:06










      • $begingroup$
        @JoséCarlosSantos Sorry, I was not looking at your comment while I was editing the answer to elaborate. Please advise me whether this still should be moved.
        $endgroup$
        – diadochos
        Apr 8 '18 at 11:18












      • $begingroup$
        Now it looks fine.
        $endgroup$
        – José Carlos Santos
        Apr 8 '18 at 11:37










      • $begingroup$
        When you write "nuclear norm is not differentiable", you copy what you read elsewhere; unfortunately (for you) it's false or at least very incomplete. Indeed, the nuclear norm is differentiable in a neighborhood of any $X$ s.t. $X^TX$ is invertible (see my post in this file). Moreover, the nuclear norm is differentiable on $C^1$ arcs $trightarrow X(t)$ s.t. $rank(X(t))$ is constant (possibly $<n$).
        $endgroup$
        – loup blanc
        Apr 8 '18 at 18:03












      • $begingroup$
        I see. I'm sorry I was very thoughtless, and thank you very much for your guidance. I will edit the post.
        $endgroup$
        – diadochos
        Apr 10 '18 at 6:15



















      0












      $begingroup$

      What about $|| M ||_{F} = mathrm{Trace}(M^{T}M)$?






      share|cite|improve this answer











      $endgroup$













      • $begingroup$
        $sqrt{M^{T}M} = (M^{T}M)^{ frac{1}{2}}$?
        $endgroup$
        – askuyue
        Nov 11 '14 at 7:32



















      0












      $begingroup$

      The challenges of calculating the gradients of $||X||_{*}$ comes from the non-smooth processing of calculating the singular values of $X$.



      Thus, I usually first transform $X$ into a symmetric positive semi-definite matrix $hat{X}$, and then we have $||hat{X}||_{*}=tr(hat{X})$. The intuitive is that the eignvalues are equals to singular values, when $hat{X}$ is a symmetric positive semi-definite matrix, and $tr(hat{X})=sum eignvalues$.



      Finally, we have $frac{partial ||hat{X}||_{*}}{partial hat{X}}=frac{partial tr(hat{X})}{partial hat{X}}=I$.



      I am not sure whether it is helpful to you.






      share|cite|improve this answer









      $endgroup$













        Your Answer





        StackExchange.ifUsing("editor", function () {
        return StackExchange.using("mathjaxEditing", function () {
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        });
        });
        }, "mathjax-editing");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "69"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        noCode: true, onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f701062%2fderivative-of-the-nuclear-norm-with-respect-to-its-argument%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        8 Answers
        8






        active

        oldest

        votes








        8 Answers
        8






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        20












        $begingroup$

        As I said in my comment, in a convex optimization setting, one would normally not use the derivative/subgradient of the nuclear norm function. It is, after all, nondifferentiable, and as such cannot be used in standard descent approaches (though I suspect some people have probably applied semismooth methods to it).



        Here are two alternate approaches for "handling" the nuclear norm.



        Semidefinite programming. We can use the following identity: the nuclear norm inequality $|X|_*leq y$ is satisfied if and only if there exist symmetric matrices $W_1$, $W_2$ satisfying
        $$begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0, ~ mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 leq 2 y$$
        Here, $succeq 0$ should be interpreted to mean that the $2times 2$ block matrix is positive semidefinite. Because of this transformation, you can handle nuclear norm minimization or upper bounds on the nuclear norm in any semidefinite programming setting.
        For instance, given some equality constraints $mathcal{A}(X)=b$ where $mathcal{A}$ is a linear operator, you could do this:
        $$begin{array}{ll}
        text{minimize} & |X|_* \
        text{subject to} & mathcal{A}(X)=b end{array}
        quadLongleftrightarrowquad
        begin{array}{ll}
        text{minimize} & tfrac{1}{2}left( mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 right) \
        text{subject to} & begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0 \ & mathcal{A}(X)=b end{array}
        $$
        My software CVX uses this transformation to implement the function norm_nuc, but any semidefinite programming software can handle this. One downside to this method is that semidefinite programming can be expensive; and if $mll n$ or $nll m$, that expense is exacerbated, since size of the linear matrix inequality is $(m+n)times (m+n)$.



        Projected/proximal gradients. Consider the following related problems:
        $$begin{array}{ll}
        text{minimize} & |mathcal{A}(X)-b|_2^2 \
        text{subject to} & |X|_*leq delta
        end{array} quad
        $$ $$text{minimize} ~~ |mathcal{A}(X)-b|_2^2+lambda|X|_*$$
        Both of these problems trace out tradeoff curves: as $delta$ or $lambda$ is varied, you generate a tradeoff between $|mathcal{A}(X)-b|$ and $|X|_*$. In a very real sense, these problems are equivalent: for a fixed value of $delta$, there is going to be a corresponding value of $lambda$ that yields the exact same value of $X$ (at least on the interior of the tradeoff curve). So it is worth considering these problems together.



        The first of these problems can be solved using a projected gradient approach. This approach alternates between gradient steps on the smooth objective and projections back onto the feasible set $|X|_*leq delta$. The projection step requires being able to compute
        $$mathop{textrm{Proj}}(Y) = mathop{textrm{arg min}}_{{X,|,|X|_*leqdelta}} | X - Y |$$
        which can be done at about the cost of a single SVD plus some $O(n)$ operations.



        The second model can be solved using a proximal gradient approach, which is very closely related to projected gradients. In this case, you alternate between taking gradient steps on the smooth portion, followed by an evaluation of the proximal function
        $$mathop{textrm{Prox}}(Y) = mathop{textrm{arg min}}_X |X|_* + tfrac{1}{2}t^{-1}|X-Y|^2$$
        where $t$ is a step size. This function can also be computed with a single SVD and some thresholding. It's actually easier to implement than the projection. For that reason, the proximal model is preferable to the projection model. When you have the choice, solve the easier model!



        I would encourage you to do a literature search on proximal gradient methods, and nuclear norm problems in particular. There is actually quite a bit of work out there on this. For example, these lecture notes by Laurent El Ghaoui at Berkeley talk about the proximal gradient method and introduce the prox function for nuclear norms. My software TFOCS includes both the nuclear norm projection and the prox function. You do not have to use this software, but you could look at the implementations of prox_nuclear and proj_nuclear for some hints.






        share|cite|improve this answer











        $endgroup$


















          20












          $begingroup$

          As I said in my comment, in a convex optimization setting, one would normally not use the derivative/subgradient of the nuclear norm function. It is, after all, nondifferentiable, and as such cannot be used in standard descent approaches (though I suspect some people have probably applied semismooth methods to it).



          Here are two alternate approaches for "handling" the nuclear norm.



          Semidefinite programming. We can use the following identity: the nuclear norm inequality $|X|_*leq y$ is satisfied if and only if there exist symmetric matrices $W_1$, $W_2$ satisfying
          $$begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0, ~ mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 leq 2 y$$
          Here, $succeq 0$ should be interpreted to mean that the $2times 2$ block matrix is positive semidefinite. Because of this transformation, you can handle nuclear norm minimization or upper bounds on the nuclear norm in any semidefinite programming setting.
          For instance, given some equality constraints $mathcal{A}(X)=b$ where $mathcal{A}$ is a linear operator, you could do this:
          $$begin{array}{ll}
          text{minimize} & |X|_* \
          text{subject to} & mathcal{A}(X)=b end{array}
          quadLongleftrightarrowquad
          begin{array}{ll}
          text{minimize} & tfrac{1}{2}left( mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 right) \
          text{subject to} & begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0 \ & mathcal{A}(X)=b end{array}
          $$
          My software CVX uses this transformation to implement the function norm_nuc, but any semidefinite programming software can handle this. One downside to this method is that semidefinite programming can be expensive; and if $mll n$ or $nll m$, that expense is exacerbated, since size of the linear matrix inequality is $(m+n)times (m+n)$.



          Projected/proximal gradients. Consider the following related problems:
          $$begin{array}{ll}
          text{minimize} & |mathcal{A}(X)-b|_2^2 \
          text{subject to} & |X|_*leq delta
          end{array} quad
          $$ $$text{minimize} ~~ |mathcal{A}(X)-b|_2^2+lambda|X|_*$$
          Both of these problems trace out tradeoff curves: as $delta$ or $lambda$ is varied, you generate a tradeoff between $|mathcal{A}(X)-b|$ and $|X|_*$. In a very real sense, these problems are equivalent: for a fixed value of $delta$, there is going to be a corresponding value of $lambda$ that yields the exact same value of $X$ (at least on the interior of the tradeoff curve). So it is worth considering these problems together.



          The first of these problems can be solved using a projected gradient approach. This approach alternates between gradient steps on the smooth objective and projections back onto the feasible set $|X|_*leq delta$. The projection step requires being able to compute
          $$mathop{textrm{Proj}}(Y) = mathop{textrm{arg min}}_{{X,|,|X|_*leqdelta}} | X - Y |$$
          which can be done at about the cost of a single SVD plus some $O(n)$ operations.



          The second model can be solved using a proximal gradient approach, which is very closely related to projected gradients. In this case, you alternate between taking gradient steps on the smooth portion, followed by an evaluation of the proximal function
          $$mathop{textrm{Prox}}(Y) = mathop{textrm{arg min}}_X |X|_* + tfrac{1}{2}t^{-1}|X-Y|^2$$
          where $t$ is a step size. This function can also be computed with a single SVD and some thresholding. It's actually easier to implement than the projection. For that reason, the proximal model is preferable to the projection model. When you have the choice, solve the easier model!



          I would encourage you to do a literature search on proximal gradient methods, and nuclear norm problems in particular. There is actually quite a bit of work out there on this. For example, these lecture notes by Laurent El Ghaoui at Berkeley talk about the proximal gradient method and introduce the prox function for nuclear norms. My software TFOCS includes both the nuclear norm projection and the prox function. You do not have to use this software, but you could look at the implementations of prox_nuclear and proj_nuclear for some hints.






          share|cite|improve this answer











          $endgroup$
















            20












            20








            20





            $begingroup$

            As I said in my comment, in a convex optimization setting, one would normally not use the derivative/subgradient of the nuclear norm function. It is, after all, nondifferentiable, and as such cannot be used in standard descent approaches (though I suspect some people have probably applied semismooth methods to it).



            Here are two alternate approaches for "handling" the nuclear norm.



            Semidefinite programming. We can use the following identity: the nuclear norm inequality $|X|_*leq y$ is satisfied if and only if there exist symmetric matrices $W_1$, $W_2$ satisfying
            $$begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0, ~ mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 leq 2 y$$
            Here, $succeq 0$ should be interpreted to mean that the $2times 2$ block matrix is positive semidefinite. Because of this transformation, you can handle nuclear norm minimization or upper bounds on the nuclear norm in any semidefinite programming setting.
            For instance, given some equality constraints $mathcal{A}(X)=b$ where $mathcal{A}$ is a linear operator, you could do this:
            $$begin{array}{ll}
            text{minimize} & |X|_* \
            text{subject to} & mathcal{A}(X)=b end{array}
            quadLongleftrightarrowquad
            begin{array}{ll}
            text{minimize} & tfrac{1}{2}left( mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 right) \
            text{subject to} & begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0 \ & mathcal{A}(X)=b end{array}
            $$
            My software CVX uses this transformation to implement the function norm_nuc, but any semidefinite programming software can handle this. One downside to this method is that semidefinite programming can be expensive; and if $mll n$ or $nll m$, that expense is exacerbated, since size of the linear matrix inequality is $(m+n)times (m+n)$.



            Projected/proximal gradients. Consider the following related problems:
            $$begin{array}{ll}
            text{minimize} & |mathcal{A}(X)-b|_2^2 \
            text{subject to} & |X|_*leq delta
            end{array} quad
            $$ $$text{minimize} ~~ |mathcal{A}(X)-b|_2^2+lambda|X|_*$$
            Both of these problems trace out tradeoff curves: as $delta$ or $lambda$ is varied, you generate a tradeoff between $|mathcal{A}(X)-b|$ and $|X|_*$. In a very real sense, these problems are equivalent: for a fixed value of $delta$, there is going to be a corresponding value of $lambda$ that yields the exact same value of $X$ (at least on the interior of the tradeoff curve). So it is worth considering these problems together.



            The first of these problems can be solved using a projected gradient approach. This approach alternates between gradient steps on the smooth objective and projections back onto the feasible set $|X|_*leq delta$. The projection step requires being able to compute
            $$mathop{textrm{Proj}}(Y) = mathop{textrm{arg min}}_{{X,|,|X|_*leqdelta}} | X - Y |$$
            which can be done at about the cost of a single SVD plus some $O(n)$ operations.



            The second model can be solved using a proximal gradient approach, which is very closely related to projected gradients. In this case, you alternate between taking gradient steps on the smooth portion, followed by an evaluation of the proximal function
            $$mathop{textrm{Prox}}(Y) = mathop{textrm{arg min}}_X |X|_* + tfrac{1}{2}t^{-1}|X-Y|^2$$
            where $t$ is a step size. This function can also be computed with a single SVD and some thresholding. It's actually easier to implement than the projection. For that reason, the proximal model is preferable to the projection model. When you have the choice, solve the easier model!



            I would encourage you to do a literature search on proximal gradient methods, and nuclear norm problems in particular. There is actually quite a bit of work out there on this. For example, these lecture notes by Laurent El Ghaoui at Berkeley talk about the proximal gradient method and introduce the prox function for nuclear norms. My software TFOCS includes both the nuclear norm projection and the prox function. You do not have to use this software, but you could look at the implementations of prox_nuclear and proj_nuclear for some hints.






            share|cite|improve this answer











            $endgroup$



            As I said in my comment, in a convex optimization setting, one would normally not use the derivative/subgradient of the nuclear norm function. It is, after all, nondifferentiable, and as such cannot be used in standard descent approaches (though I suspect some people have probably applied semismooth methods to it).



            Here are two alternate approaches for "handling" the nuclear norm.



            Semidefinite programming. We can use the following identity: the nuclear norm inequality $|X|_*leq y$ is satisfied if and only if there exist symmetric matrices $W_1$, $W_2$ satisfying
            $$begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0, ~ mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 leq 2 y$$
            Here, $succeq 0$ should be interpreted to mean that the $2times 2$ block matrix is positive semidefinite. Because of this transformation, you can handle nuclear norm minimization or upper bounds on the nuclear norm in any semidefinite programming setting.
            For instance, given some equality constraints $mathcal{A}(X)=b$ where $mathcal{A}$ is a linear operator, you could do this:
            $$begin{array}{ll}
            text{minimize} & |X|_* \
            text{subject to} & mathcal{A}(X)=b end{array}
            quadLongleftrightarrowquad
            begin{array}{ll}
            text{minimize} & tfrac{1}{2}left( mathop{textrm{Tr}}W_1 + mathop{textrm{Tr}}W_2 right) \
            text{subject to} & begin{bmatrix} W_1 & X \ X^T & W_2 end{bmatrix} succeq 0 \ & mathcal{A}(X)=b end{array}
            $$
            My software CVX uses this transformation to implement the function norm_nuc, but any semidefinite programming software can handle this. One downside to this method is that semidefinite programming can be expensive; and if $mll n$ or $nll m$, that expense is exacerbated, since size of the linear matrix inequality is $(m+n)times (m+n)$.



            Projected/proximal gradients. Consider the following related problems:
            $$begin{array}{ll}
            text{minimize} & |mathcal{A}(X)-b|_2^2 \
            text{subject to} & |X|_*leq delta
            end{array} quad
            $$ $$text{minimize} ~~ |mathcal{A}(X)-b|_2^2+lambda|X|_*$$
            Both of these problems trace out tradeoff curves: as $delta$ or $lambda$ is varied, you generate a tradeoff between $|mathcal{A}(X)-b|$ and $|X|_*$. In a very real sense, these problems are equivalent: for a fixed value of $delta$, there is going to be a corresponding value of $lambda$ that yields the exact same value of $X$ (at least on the interior of the tradeoff curve). So it is worth considering these problems together.



            The first of these problems can be solved using a projected gradient approach. This approach alternates between gradient steps on the smooth objective and projections back onto the feasible set $|X|_*leq delta$. The projection step requires being able to compute
            $$mathop{textrm{Proj}}(Y) = mathop{textrm{arg min}}_{{X,|,|X|_*leqdelta}} | X - Y |$$
            which can be done at about the cost of a single SVD plus some $O(n)$ operations.



            The second model can be solved using a proximal gradient approach, which is very closely related to projected gradients. In this case, you alternate between taking gradient steps on the smooth portion, followed by an evaluation of the proximal function
            $$mathop{textrm{Prox}}(Y) = mathop{textrm{arg min}}_X |X|_* + tfrac{1}{2}t^{-1}|X-Y|^2$$
            where $t$ is a step size. This function can also be computed with a single SVD and some thresholding. It's actually easier to implement than the projection. For that reason, the proximal model is preferable to the projection model. When you have the choice, solve the easier model!



            I would encourage you to do a literature search on proximal gradient methods, and nuclear norm problems in particular. There is actually quite a bit of work out there on this. For example, these lecture notes by Laurent El Ghaoui at Berkeley talk about the proximal gradient method and introduce the prox function for nuclear norms. My software TFOCS includes both the nuclear norm projection and the prox function. You do not have to use this software, but you could look at the implementations of prox_nuclear and proj_nuclear for some hints.







            share|cite|improve this answer














            share|cite|improve this answer



            share|cite|improve this answer








            edited Apr 2 '14 at 1:20

























            answered Mar 8 '14 at 15:58









            Michael GrantMichael Grant

            15k11944




            15k11944























                17












                $begingroup$

                Start with the SVD decomposition of $x$:



                $$x=USigma V^T$$



                Then $$|x|_*=tr(sqrt{x^Tx})=tr(sqrt{(USigma V^T)^T(USigma V^T)})$$



                $$Rightarrow |x|_*=tr(sqrt{VSigma U^T USigma V^T})=tr(sqrt{VSigma^2V^T})$$



                By circularity of trace:



                $$Rightarrow |x|_*=tr(sqrt{V^TVSigma^2})=tr(sqrt{V^TVSigma^2})=tr(sqrt{Sigma^2})=tr(Sigma)$$



                Since the elements of $Sigma$ are non-negative.



                Therefore nuclear norm can be also defined as the sum of the absolute values of the singular value decomposition of the input matrix.



                Now, note that the absolute value function is not differentiable on every point in its domain, but you can find a subgradient.



                $$frac{partial |x|_*}{partial x}=frac{partial tr(Sigma)}{partial x}=frac{ tr(partialSigma)}{partial x}$$



                You should find $partialSigma$. Since $Sigma$ is diagonal, the subdifferential set of $Sigma$ is: $partialSigma=SigmaSigma^{-1}partialSigma$, now we have:



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1}partialSigma)}{partial x}$$ (I)



                So we should find $partialSigma$.



                $x=USigma V^T$, therefore:
                $$partial x=partial USigma V^T+UpartialSigma V^T+USigmapartial V^T$$



                Therefore:



                $$UpartialSigma V^T=partial x-partial USigma V^T-USigmapartial V^T$$



                $$Rightarrow U^TUpartialSigma V^TV=U^Tpartial xV-U^Tpartial USigma V^TV-U^TUSigmapartial V^TV$$



                $$Rightarrow partialSigma =U^Tpartial xV-U^Tpartial USigma - Sigmapartial V^TV$$



                You can show that $-U^Tpartial USigma - Sigmapartial V^TV=0$ (Hint: diagonal and antisymmetric matrices, I skip the proof here.), therefore:



                $$partialSigma =U^Tpartial xV$$



                By substitution into (I):



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1} partialSigma)}{partial x}=frac{ tr(U^Tpartial xV)}{partial x}=frac{ tr(VU^Tpartial x)}{partial x}=(VU^T)^T$$



                Therefore you can use $U V^T$ as the subgradient.






                share|cite|improve this answer











                $endgroup$













                • $begingroup$
                  The diagonal of $Sigma$ is already nonnegative. No need for the absolute value.
                  $endgroup$
                  – Rodrigo de Azevedo
                  Dec 21 '16 at 20:27










                • $begingroup$
                  Thanks, I fixed it @RodrigodeAzevedo
                  $endgroup$
                  – Alt
                  Dec 21 '16 at 23:48












                • $begingroup$
                  Alt, I am trying to understand why taking the nuclear norm is not differentiable. You said it is because of the absolute values, but as @Rodrigo de Avezedo pointed out, the Sigma is already non-negative. Given that there is no absolute value, why is it not differentiable?
                  $endgroup$
                  – The_Anomaly
                  Jan 17 '18 at 19:40






                • 1




                  $begingroup$
                  @The_Anomaly, the singular values are non-negative, but we are taking the derivative with respect to the Matrix $x$. For example, if $x$ is a $1times 1$ matrix, then the nuclear norm is equivalent to the absolute value of $x$, which is non-diferentiable.
                  $endgroup$
                  – Alt
                  Jan 17 '18 at 20:50
















                17












                $begingroup$

                Start with the SVD decomposition of $x$:



                $$x=USigma V^T$$



                Then $$|x|_*=tr(sqrt{x^Tx})=tr(sqrt{(USigma V^T)^T(USigma V^T)})$$



                $$Rightarrow |x|_*=tr(sqrt{VSigma U^T USigma V^T})=tr(sqrt{VSigma^2V^T})$$



                By circularity of trace:



                $$Rightarrow |x|_*=tr(sqrt{V^TVSigma^2})=tr(sqrt{V^TVSigma^2})=tr(sqrt{Sigma^2})=tr(Sigma)$$



                Since the elements of $Sigma$ are non-negative.



                Therefore nuclear norm can be also defined as the sum of the absolute values of the singular value decomposition of the input matrix.



                Now, note that the absolute value function is not differentiable on every point in its domain, but you can find a subgradient.



                $$frac{partial |x|_*}{partial x}=frac{partial tr(Sigma)}{partial x}=frac{ tr(partialSigma)}{partial x}$$



                You should find $partialSigma$. Since $Sigma$ is diagonal, the subdifferential set of $Sigma$ is: $partialSigma=SigmaSigma^{-1}partialSigma$, now we have:



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1}partialSigma)}{partial x}$$ (I)



                So we should find $partialSigma$.



                $x=USigma V^T$, therefore:
                $$partial x=partial USigma V^T+UpartialSigma V^T+USigmapartial V^T$$



                Therefore:



                $$UpartialSigma V^T=partial x-partial USigma V^T-USigmapartial V^T$$



                $$Rightarrow U^TUpartialSigma V^TV=U^Tpartial xV-U^Tpartial USigma V^TV-U^TUSigmapartial V^TV$$



                $$Rightarrow partialSigma =U^Tpartial xV-U^Tpartial USigma - Sigmapartial V^TV$$



                You can show that $-U^Tpartial USigma - Sigmapartial V^TV=0$ (Hint: diagonal and antisymmetric matrices, I skip the proof here.), therefore:



                $$partialSigma =U^Tpartial xV$$



                By substitution into (I):



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1} partialSigma)}{partial x}=frac{ tr(U^Tpartial xV)}{partial x}=frac{ tr(VU^Tpartial x)}{partial x}=(VU^T)^T$$



                Therefore you can use $U V^T$ as the subgradient.






                share|cite|improve this answer











                $endgroup$













                • $begingroup$
                  The diagonal of $Sigma$ is already nonnegative. No need for the absolute value.
                  $endgroup$
                  – Rodrigo de Azevedo
                  Dec 21 '16 at 20:27










                • $begingroup$
                  Thanks, I fixed it @RodrigodeAzevedo
                  $endgroup$
                  – Alt
                  Dec 21 '16 at 23:48












                • $begingroup$
                  Alt, I am trying to understand why taking the nuclear norm is not differentiable. You said it is because of the absolute values, but as @Rodrigo de Avezedo pointed out, the Sigma is already non-negative. Given that there is no absolute value, why is it not differentiable?
                  $endgroup$
                  – The_Anomaly
                  Jan 17 '18 at 19:40






                • 1




                  $begingroup$
                  @The_Anomaly, the singular values are non-negative, but we are taking the derivative with respect to the Matrix $x$. For example, if $x$ is a $1times 1$ matrix, then the nuclear norm is equivalent to the absolute value of $x$, which is non-diferentiable.
                  $endgroup$
                  – Alt
                  Jan 17 '18 at 20:50














                17












                17








                17





                $begingroup$

                Start with the SVD decomposition of $x$:



                $$x=USigma V^T$$



                Then $$|x|_*=tr(sqrt{x^Tx})=tr(sqrt{(USigma V^T)^T(USigma V^T)})$$



                $$Rightarrow |x|_*=tr(sqrt{VSigma U^T USigma V^T})=tr(sqrt{VSigma^2V^T})$$



                By circularity of trace:



                $$Rightarrow |x|_*=tr(sqrt{V^TVSigma^2})=tr(sqrt{V^TVSigma^2})=tr(sqrt{Sigma^2})=tr(Sigma)$$



                Since the elements of $Sigma$ are non-negative.



                Therefore nuclear norm can be also defined as the sum of the absolute values of the singular value decomposition of the input matrix.



                Now, note that the absolute value function is not differentiable on every point in its domain, but you can find a subgradient.



                $$frac{partial |x|_*}{partial x}=frac{partial tr(Sigma)}{partial x}=frac{ tr(partialSigma)}{partial x}$$



                You should find $partialSigma$. Since $Sigma$ is diagonal, the subdifferential set of $Sigma$ is: $partialSigma=SigmaSigma^{-1}partialSigma$, now we have:



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1}partialSigma)}{partial x}$$ (I)



                So we should find $partialSigma$.



                $x=USigma V^T$, therefore:
                $$partial x=partial USigma V^T+UpartialSigma V^T+USigmapartial V^T$$



                Therefore:



                $$UpartialSigma V^T=partial x-partial USigma V^T-USigmapartial V^T$$



                $$Rightarrow U^TUpartialSigma V^TV=U^Tpartial xV-U^Tpartial USigma V^TV-U^TUSigmapartial V^TV$$



                $$Rightarrow partialSigma =U^Tpartial xV-U^Tpartial USigma - Sigmapartial V^TV$$



                You can show that $-U^Tpartial USigma - Sigmapartial V^TV=0$ (Hint: diagonal and antisymmetric matrices, I skip the proof here.), therefore:



                $$partialSigma =U^Tpartial xV$$



                By substitution into (I):



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1} partialSigma)}{partial x}=frac{ tr(U^Tpartial xV)}{partial x}=frac{ tr(VU^Tpartial x)}{partial x}=(VU^T)^T$$



                Therefore you can use $U V^T$ as the subgradient.






                share|cite|improve this answer











                $endgroup$



                Start with the SVD decomposition of $x$:



                $$x=USigma V^T$$



                Then $$|x|_*=tr(sqrt{x^Tx})=tr(sqrt{(USigma V^T)^T(USigma V^T)})$$



                $$Rightarrow |x|_*=tr(sqrt{VSigma U^T USigma V^T})=tr(sqrt{VSigma^2V^T})$$



                By circularity of trace:



                $$Rightarrow |x|_*=tr(sqrt{V^TVSigma^2})=tr(sqrt{V^TVSigma^2})=tr(sqrt{Sigma^2})=tr(Sigma)$$



                Since the elements of $Sigma$ are non-negative.



                Therefore nuclear norm can be also defined as the sum of the absolute values of the singular value decomposition of the input matrix.



                Now, note that the absolute value function is not differentiable on every point in its domain, but you can find a subgradient.



                $$frac{partial |x|_*}{partial x}=frac{partial tr(Sigma)}{partial x}=frac{ tr(partialSigma)}{partial x}$$



                You should find $partialSigma$. Since $Sigma$ is diagonal, the subdifferential set of $Sigma$ is: $partialSigma=SigmaSigma^{-1}partialSigma$, now we have:



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1}partialSigma)}{partial x}$$ (I)



                So we should find $partialSigma$.



                $x=USigma V^T$, therefore:
                $$partial x=partial USigma V^T+UpartialSigma V^T+USigmapartial V^T$$



                Therefore:



                $$UpartialSigma V^T=partial x-partial USigma V^T-USigmapartial V^T$$



                $$Rightarrow U^TUpartialSigma V^TV=U^Tpartial xV-U^Tpartial USigma V^TV-U^TUSigmapartial V^TV$$



                $$Rightarrow partialSigma =U^Tpartial xV-U^Tpartial USigma - Sigmapartial V^TV$$



                You can show that $-U^Tpartial USigma - Sigmapartial V^TV=0$ (Hint: diagonal and antisymmetric matrices, I skip the proof here.), therefore:



                $$partialSigma =U^Tpartial xV$$



                By substitution into (I):



                $$frac{partial |x|_*}{partial x}=frac{ tr(SigmaSigma^{-1} partialSigma)}{partial x}=frac{ tr(U^Tpartial xV)}{partial x}=frac{ tr(VU^Tpartial x)}{partial x}=(VU^T)^T$$



                Therefore you can use $U V^T$ as the subgradient.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Dec 21 '16 at 23:53

























                answered Mar 6 '14 at 0:55









                AltAlt

                1,71931234




                1,71931234












                • $begingroup$
                  The diagonal of $Sigma$ is already nonnegative. No need for the absolute value.
                  $endgroup$
                  – Rodrigo de Azevedo
                  Dec 21 '16 at 20:27










                • $begingroup$
                  Thanks, I fixed it @RodrigodeAzevedo
                  $endgroup$
                  – Alt
                  Dec 21 '16 at 23:48












                • $begingroup$
                  Alt, I am trying to understand why taking the nuclear norm is not differentiable. You said it is because of the absolute values, but as @Rodrigo de Avezedo pointed out, the Sigma is already non-negative. Given that there is no absolute value, why is it not differentiable?
                  $endgroup$
                  – The_Anomaly
                  Jan 17 '18 at 19:40






                • 1




                  $begingroup$
                  @The_Anomaly, the singular values are non-negative, but we are taking the derivative with respect to the Matrix $x$. For example, if $x$ is a $1times 1$ matrix, then the nuclear norm is equivalent to the absolute value of $x$, which is non-diferentiable.
                  $endgroup$
                  – Alt
                  Jan 17 '18 at 20:50


















                • $begingroup$
                  The diagonal of $Sigma$ is already nonnegative. No need for the absolute value.
                  $endgroup$
                  – Rodrigo de Azevedo
                  Dec 21 '16 at 20:27










                • $begingroup$
                  Thanks, I fixed it @RodrigodeAzevedo
                  $endgroup$
                  – Alt
                  Dec 21 '16 at 23:48












                • $begingroup$
                  Alt, I am trying to understand why taking the nuclear norm is not differentiable. You said it is because of the absolute values, but as @Rodrigo de Avezedo pointed out, the Sigma is already non-negative. Given that there is no absolute value, why is it not differentiable?
                  $endgroup$
                  – The_Anomaly
                  Jan 17 '18 at 19:40






                • 1




                  $begingroup$
                  @The_Anomaly, the singular values are non-negative, but we are taking the derivative with respect to the Matrix $x$. For example, if $x$ is a $1times 1$ matrix, then the nuclear norm is equivalent to the absolute value of $x$, which is non-diferentiable.
                  $endgroup$
                  – Alt
                  Jan 17 '18 at 20:50
















                $begingroup$
                The diagonal of $Sigma$ is already nonnegative. No need for the absolute value.
                $endgroup$
                – Rodrigo de Azevedo
                Dec 21 '16 at 20:27




                $begingroup$
                The diagonal of $Sigma$ is already nonnegative. No need for the absolute value.
                $endgroup$
                – Rodrigo de Azevedo
                Dec 21 '16 at 20:27












                $begingroup$
                Thanks, I fixed it @RodrigodeAzevedo
                $endgroup$
                – Alt
                Dec 21 '16 at 23:48






                $begingroup$
                Thanks, I fixed it @RodrigodeAzevedo
                $endgroup$
                – Alt
                Dec 21 '16 at 23:48














                $begingroup$
                Alt, I am trying to understand why taking the nuclear norm is not differentiable. You said it is because of the absolute values, but as @Rodrigo de Avezedo pointed out, the Sigma is already non-negative. Given that there is no absolute value, why is it not differentiable?
                $endgroup$
                – The_Anomaly
                Jan 17 '18 at 19:40




                $begingroup$
                Alt, I am trying to understand why taking the nuclear norm is not differentiable. You said it is because of the absolute values, but as @Rodrigo de Avezedo pointed out, the Sigma is already non-negative. Given that there is no absolute value, why is it not differentiable?
                $endgroup$
                – The_Anomaly
                Jan 17 '18 at 19:40




                1




                1




                $begingroup$
                @The_Anomaly, the singular values are non-negative, but we are taking the derivative with respect to the Matrix $x$. For example, if $x$ is a $1times 1$ matrix, then the nuclear norm is equivalent to the absolute value of $x$, which is non-diferentiable.
                $endgroup$
                – Alt
                Jan 17 '18 at 20:50




                $begingroup$
                @The_Anomaly, the singular values are non-negative, but we are taking the derivative with respect to the Matrix $x$. For example, if $x$ is a $1times 1$ matrix, then the nuclear norm is equivalent to the absolute value of $x$, which is non-diferentiable.
                $endgroup$
                – Alt
                Jan 17 '18 at 20:50











                2












                $begingroup$

                You can use this nice result for the differential of the trace
                $$ eqalign {
                d,mathrm{tr}(f(A)) &= f'(A^T):dA cr
                } $$
                to write
                $$ eqalign {
                d,mathrm{tr}((x^Tx)^{frac {1} {2}}) &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:d(x^Tx) cr
                &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:(dx^T x + x^T dx) cr
                &= x(x^Tx)^{-frac {1} {2}}: dx cr
                } $$
                Yielding the derivative as
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx)^{-frac {1} {2}} cr
                } $$
                Another nice result (this one's from Higham)
                $$ eqalign {
                A,f(B,A) &= f(A,B),A cr
                } $$
                yields an alternative expression with (potentially) smaller dimensions
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= (x,x^T)^{-frac {1} {2}}x cr
                } $$



                While the square root of $x^Tx$ certainly exists, the inverse may not. So you might need some sort of regularization, e.g.
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx+varepsilon I)^{-frac {1} {2}} cr
                } $$






                share|cite|improve this answer









                $endgroup$









                • 2




                  $begingroup$
                  I believe the colon notation represents the Frobenius product [en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product].
                  $endgroup$
                  – lynne
                  Oct 21 '14 at 0:47
















                2












                $begingroup$

                You can use this nice result for the differential of the trace
                $$ eqalign {
                d,mathrm{tr}(f(A)) &= f'(A^T):dA cr
                } $$
                to write
                $$ eqalign {
                d,mathrm{tr}((x^Tx)^{frac {1} {2}}) &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:d(x^Tx) cr
                &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:(dx^T x + x^T dx) cr
                &= x(x^Tx)^{-frac {1} {2}}: dx cr
                } $$
                Yielding the derivative as
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx)^{-frac {1} {2}} cr
                } $$
                Another nice result (this one's from Higham)
                $$ eqalign {
                A,f(B,A) &= f(A,B),A cr
                } $$
                yields an alternative expression with (potentially) smaller dimensions
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= (x,x^T)^{-frac {1} {2}}x cr
                } $$



                While the square root of $x^Tx$ certainly exists, the inverse may not. So you might need some sort of regularization, e.g.
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx+varepsilon I)^{-frac {1} {2}} cr
                } $$






                share|cite|improve this answer









                $endgroup$









                • 2




                  $begingroup$
                  I believe the colon notation represents the Frobenius product [en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product].
                  $endgroup$
                  – lynne
                  Oct 21 '14 at 0:47














                2












                2








                2





                $begingroup$

                You can use this nice result for the differential of the trace
                $$ eqalign {
                d,mathrm{tr}(f(A)) &= f'(A^T):dA cr
                } $$
                to write
                $$ eqalign {
                d,mathrm{tr}((x^Tx)^{frac {1} {2}}) &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:d(x^Tx) cr
                &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:(dx^T x + x^T dx) cr
                &= x(x^Tx)^{-frac {1} {2}}: dx cr
                } $$
                Yielding the derivative as
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx)^{-frac {1} {2}} cr
                } $$
                Another nice result (this one's from Higham)
                $$ eqalign {
                A,f(B,A) &= f(A,B),A cr
                } $$
                yields an alternative expression with (potentially) smaller dimensions
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= (x,x^T)^{-frac {1} {2}}x cr
                } $$



                While the square root of $x^Tx$ certainly exists, the inverse may not. So you might need some sort of regularization, e.g.
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx+varepsilon I)^{-frac {1} {2}} cr
                } $$






                share|cite|improve this answer









                $endgroup$



                You can use this nice result for the differential of the trace
                $$ eqalign {
                d,mathrm{tr}(f(A)) &= f'(A^T):dA cr
                } $$
                to write
                $$ eqalign {
                d,mathrm{tr}((x^Tx)^{frac {1} {2}}) &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:d(x^Tx) cr
                &= frac {1} {2} (x^Tx)^{-frac {1} {2}}:(dx^T x + x^T dx) cr
                &= x(x^Tx)^{-frac {1} {2}}: dx cr
                } $$
                Yielding the derivative as
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx)^{-frac {1} {2}} cr
                } $$
                Another nice result (this one's from Higham)
                $$ eqalign {
                A,f(B,A) &= f(A,B),A cr
                } $$
                yields an alternative expression with (potentially) smaller dimensions
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= (x,x^T)^{-frac {1} {2}}x cr
                } $$



                While the square root of $x^Tx$ certainly exists, the inverse may not. So you might need some sort of regularization, e.g.
                $$ eqalign {
                frac {partial|x|_*} {partial x} &= x(x^Tx+varepsilon I)^{-frac {1} {2}} cr
                } $$







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered Sep 17 '14 at 1:19









                greggreg

                211




                211








                • 2




                  $begingroup$
                  I believe the colon notation represents the Frobenius product [en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product].
                  $endgroup$
                  – lynne
                  Oct 21 '14 at 0:47














                • 2




                  $begingroup$
                  I believe the colon notation represents the Frobenius product [en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product].
                  $endgroup$
                  – lynne
                  Oct 21 '14 at 0:47








                2




                2




                $begingroup$
                I believe the colon notation represents the Frobenius product [en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product].
                $endgroup$
                – lynne
                Oct 21 '14 at 0:47




                $begingroup$
                I believe the colon notation represents the Frobenius product [en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product].
                $endgroup$
                – lynne
                Oct 21 '14 at 0:47











                2












                $begingroup$

                Of course, $n:xin M_{n,p}rightarrow tr(sqrt{x^Tx})$ can be derived in $x$ s.t. $x^Tx$ is invertible, that is, in the generic case when $ngeq p$ (if $nleq p$, then consider $tr(sqrt{xx^T})$). The result of greg is correct ; yet, his proof is unclear and I rewrite it for convenience.



                If $A$ is symmetric $>0$, then $f:Arightarrow sqrt{A}$ is a matrix function (cf. the Higham's book about this subject) ; if $g$ is a matrix function and $phi:Arightarrow tr(g(A))$, then its derivative is $Dphi_A:Krightarrow tr(g'(A)K)$. Let $A=x^Tx$. Thus $Dn_x:Hrightarrow tr(f'(A)(H^Tx+x^TH))=tr((f'(A)^Tx^T+f'(A)x^T)H)$. Then the gradient of $n$ is $nabla(n)(x)=x(f'(A)+f'(A)^T)=2xf'(A)=x(x^Tx)^{-1/2}$.



                As Alt did, we can use the SVD decomposition and we find $nabla(n)(x)=USigma (Sigma^TSigma)^{-1/2}V^T$ ($=UV^T$ if $n=p$). Recall to Alt that the diagonal of $Sigma$ is $geq 0$.






                share|cite|improve this answer











                $endgroup$


















                  2












                  $begingroup$

                  Of course, $n:xin M_{n,p}rightarrow tr(sqrt{x^Tx})$ can be derived in $x$ s.t. $x^Tx$ is invertible, that is, in the generic case when $ngeq p$ (if $nleq p$, then consider $tr(sqrt{xx^T})$). The result of greg is correct ; yet, his proof is unclear and I rewrite it for convenience.



                  If $A$ is symmetric $>0$, then $f:Arightarrow sqrt{A}$ is a matrix function (cf. the Higham's book about this subject) ; if $g$ is a matrix function and $phi:Arightarrow tr(g(A))$, then its derivative is $Dphi_A:Krightarrow tr(g'(A)K)$. Let $A=x^Tx$. Thus $Dn_x:Hrightarrow tr(f'(A)(H^Tx+x^TH))=tr((f'(A)^Tx^T+f'(A)x^T)H)$. Then the gradient of $n$ is $nabla(n)(x)=x(f'(A)+f'(A)^T)=2xf'(A)=x(x^Tx)^{-1/2}$.



                  As Alt did, we can use the SVD decomposition and we find $nabla(n)(x)=USigma (Sigma^TSigma)^{-1/2}V^T$ ($=UV^T$ if $n=p$). Recall to Alt that the diagonal of $Sigma$ is $geq 0$.






                  share|cite|improve this answer











                  $endgroup$
















                    2












                    2








                    2





                    $begingroup$

                    Of course, $n:xin M_{n,p}rightarrow tr(sqrt{x^Tx})$ can be derived in $x$ s.t. $x^Tx$ is invertible, that is, in the generic case when $ngeq p$ (if $nleq p$, then consider $tr(sqrt{xx^T})$). The result of greg is correct ; yet, his proof is unclear and I rewrite it for convenience.



                    If $A$ is symmetric $>0$, then $f:Arightarrow sqrt{A}$ is a matrix function (cf. the Higham's book about this subject) ; if $g$ is a matrix function and $phi:Arightarrow tr(g(A))$, then its derivative is $Dphi_A:Krightarrow tr(g'(A)K)$. Let $A=x^Tx$. Thus $Dn_x:Hrightarrow tr(f'(A)(H^Tx+x^TH))=tr((f'(A)^Tx^T+f'(A)x^T)H)$. Then the gradient of $n$ is $nabla(n)(x)=x(f'(A)+f'(A)^T)=2xf'(A)=x(x^Tx)^{-1/2}$.



                    As Alt did, we can use the SVD decomposition and we find $nabla(n)(x)=USigma (Sigma^TSigma)^{-1/2}V^T$ ($=UV^T$ if $n=p$). Recall to Alt that the diagonal of $Sigma$ is $geq 0$.






                    share|cite|improve this answer











                    $endgroup$



                    Of course, $n:xin M_{n,p}rightarrow tr(sqrt{x^Tx})$ can be derived in $x$ s.t. $x^Tx$ is invertible, that is, in the generic case when $ngeq p$ (if $nleq p$, then consider $tr(sqrt{xx^T})$). The result of greg is correct ; yet, his proof is unclear and I rewrite it for convenience.



                    If $A$ is symmetric $>0$, then $f:Arightarrow sqrt{A}$ is a matrix function (cf. the Higham's book about this subject) ; if $g$ is a matrix function and $phi:Arightarrow tr(g(A))$, then its derivative is $Dphi_A:Krightarrow tr(g'(A)K)$. Let $A=x^Tx$. Thus $Dn_x:Hrightarrow tr(f'(A)(H^Tx+x^TH))=tr((f'(A)^Tx^T+f'(A)x^T)H)$. Then the gradient of $n$ is $nabla(n)(x)=x(f'(A)+f'(A)^T)=2xf'(A)=x(x^Tx)^{-1/2}$.



                    As Alt did, we can use the SVD decomposition and we find $nabla(n)(x)=USigma (Sigma^TSigma)^{-1/2}V^T$ ($=UV^T$ if $n=p$). Recall to Alt that the diagonal of $Sigma$ is $geq 0$.







                    share|cite|improve this answer














                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited Nov 11 '14 at 15:36

























                    answered Nov 11 '14 at 13:16









                    loup blancloup blanc

                    23k21850




                    23k21850























                        1












                        $begingroup$

                        Alt's answer has a fundamental error. First of all, the nuclear norm is the sum of all singular values not the absolute of the singular values.



                        To make it right, we need to first define the square root for matrix as $sqrt{BB}=B$. As Alt shown,



                        $||x||_*=tr(sqrt{VSigma^2V^T})$



                        But we cannot use the circularity of trace here because it is not well defined.



                        We should do something like this,



                        $||x||_*=tr(sqrt{VSigma^2V^T})=tr(sqrt{VSigma V^TVSigma V^T})=tr(VSigma V^T)$,



                        the last equality is based on the definition of the square root for matrix described above. Then by the circularity of trace, we get



                        $tr(VSigma V^T)=tr(Sigma V^TV)=tr(Sigma)=sum_i sigma_i$.






                        share|cite|improve this answer









                        $endgroup$













                        • $begingroup$
                          Does this mean it can be negative? For real numbers we often define $sqrt(x^2):=|x|$. What is the equivalent for matrices?
                          $endgroup$
                          – Harsh
                          Aug 21 '18 at 16:08


















                        1












                        $begingroup$

                        Alt's answer has a fundamental error. First of all, the nuclear norm is the sum of all singular values not the absolute of the singular values.



                        To make it right, we need to first define the square root for matrix as $sqrt{BB}=B$. As Alt shown,



                        $||x||_*=tr(sqrt{VSigma^2V^T})$



                        But we cannot use the circularity of trace here because it is not well defined.



                        We should do something like this,



                        $||x||_*=tr(sqrt{VSigma^2V^T})=tr(sqrt{VSigma V^TVSigma V^T})=tr(VSigma V^T)$,



                        the last equality is based on the definition of the square root for matrix described above. Then by the circularity of trace, we get



                        $tr(VSigma V^T)=tr(Sigma V^TV)=tr(Sigma)=sum_i sigma_i$.






                        share|cite|improve this answer









                        $endgroup$













                        • $begingroup$
                          Does this mean it can be negative? For real numbers we often define $sqrt(x^2):=|x|$. What is the equivalent for matrices?
                          $endgroup$
                          – Harsh
                          Aug 21 '18 at 16:08
















                        1












                        1








                        1





                        $begingroup$

                        Alt's answer has a fundamental error. First of all, the nuclear norm is the sum of all singular values not the absolute of the singular values.



                        To make it right, we need to first define the square root for matrix as $sqrt{BB}=B$. As Alt shown,



                        $||x||_*=tr(sqrt{VSigma^2V^T})$



                        But we cannot use the circularity of trace here because it is not well defined.



                        We should do something like this,



                        $||x||_*=tr(sqrt{VSigma^2V^T})=tr(sqrt{VSigma V^TVSigma V^T})=tr(VSigma V^T)$,



                        the last equality is based on the definition of the square root for matrix described above. Then by the circularity of trace, we get



                        $tr(VSigma V^T)=tr(Sigma V^TV)=tr(Sigma)=sum_i sigma_i$.






                        share|cite|improve this answer









                        $endgroup$



                        Alt's answer has a fundamental error. First of all, the nuclear norm is the sum of all singular values not the absolute of the singular values.



                        To make it right, we need to first define the square root for matrix as $sqrt{BB}=B$. As Alt shown,



                        $||x||_*=tr(sqrt{VSigma^2V^T})$



                        But we cannot use the circularity of trace here because it is not well defined.



                        We should do something like this,



                        $||x||_*=tr(sqrt{VSigma^2V^T})=tr(sqrt{VSigma V^TVSigma V^T})=tr(VSigma V^T)$,



                        the last equality is based on the definition of the square root for matrix described above. Then by the circularity of trace, we get



                        $tr(VSigma V^T)=tr(Sigma V^TV)=tr(Sigma)=sum_i sigma_i$.







                        share|cite|improve this answer












                        share|cite|improve this answer



                        share|cite|improve this answer










                        answered Feb 19 '16 at 15:04









                        E.J.E.J.

                        488416




                        488416












                        • $begingroup$
                          Does this mean it can be negative? For real numbers we often define $sqrt(x^2):=|x|$. What is the equivalent for matrices?
                          $endgroup$
                          – Harsh
                          Aug 21 '18 at 16:08




















                        • $begingroup$
                          Does this mean it can be negative? For real numbers we often define $sqrt(x^2):=|x|$. What is the equivalent for matrices?
                          $endgroup$
                          – Harsh
                          Aug 21 '18 at 16:08


















                        $begingroup$
                        Does this mean it can be negative? For real numbers we often define $sqrt(x^2):=|x|$. What is the equivalent for matrices?
                        $endgroup$
                        – Harsh
                        Aug 21 '18 at 16:08






                        $begingroup$
                        Does this mean it can be negative? For real numbers we often define $sqrt(x^2):=|x|$. What is the equivalent for matrices?
                        $endgroup$
                        – Harsh
                        Aug 21 '18 at 16:08













                        1












                        $begingroup$

                        Short answer



                        Nuclear norm has subgradients (w.r.t. its argument). You may use $UV^top$ in your algorithm if you need one.



                        See https://math.stackexchange.com/a/1016743/351390 for where it is actually differentiable. You should also see the comment by loup blanc below.



                        Details



                        The elements in $partial|X|_*$ can be characterized as a sum of two parts:
                        Let $X = U Sigma V^top$ be a (skinny) singular value decomposition, then



                        $$Y in partial |X|_* quad Leftrightarrow quad Y = UV^top + W text{ for some } W in T^perp$$



                        where the definition of subspace (of matrices) $T$ is a bit complicated:
                        $$T :=text{span}({x v_j^top| j in {1, ldots, m}; x text{ is any vector}} cup {u_i y^top| i in {1, ldots, n}; y text{ is any vector}})$$
                        However, you don't need to pay too much attention to it, because obviously $0$ is an element of $T^perp$, therefore at least
                        $$UV^top in partial|X|_*$$



                        For proof, as commented by askuyue, see https://www.sciencedirect.com/science/article/pii/0024379592904072






                        share|cite|improve this answer











                        $endgroup$













                        • $begingroup$
                          Welcome to MSE. This should be a comment, not a answer.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:06










                        • $begingroup$
                          @JoséCarlosSantos Sorry, I was not looking at your comment while I was editing the answer to elaborate. Please advise me whether this still should be moved.
                          $endgroup$
                          – diadochos
                          Apr 8 '18 at 11:18












                        • $begingroup$
                          Now it looks fine.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:37










                        • $begingroup$
                          When you write "nuclear norm is not differentiable", you copy what you read elsewhere; unfortunately (for you) it's false or at least very incomplete. Indeed, the nuclear norm is differentiable in a neighborhood of any $X$ s.t. $X^TX$ is invertible (see my post in this file). Moreover, the nuclear norm is differentiable on $C^1$ arcs $trightarrow X(t)$ s.t. $rank(X(t))$ is constant (possibly $<n$).
                          $endgroup$
                          – loup blanc
                          Apr 8 '18 at 18:03












                        • $begingroup$
                          I see. I'm sorry I was very thoughtless, and thank you very much for your guidance. I will edit the post.
                          $endgroup$
                          – diadochos
                          Apr 10 '18 at 6:15
















                        1












                        $begingroup$

                        Short answer



                        Nuclear norm has subgradients (w.r.t. its argument). You may use $UV^top$ in your algorithm if you need one.



                        See https://math.stackexchange.com/a/1016743/351390 for where it is actually differentiable. You should also see the comment by loup blanc below.



                        Details



                        The elements in $partial|X|_*$ can be characterized as a sum of two parts:
                        Let $X = U Sigma V^top$ be a (skinny) singular value decomposition, then



                        $$Y in partial |X|_* quad Leftrightarrow quad Y = UV^top + W text{ for some } W in T^perp$$



                        where the definition of subspace (of matrices) $T$ is a bit complicated:
                        $$T :=text{span}({x v_j^top| j in {1, ldots, m}; x text{ is any vector}} cup {u_i y^top| i in {1, ldots, n}; y text{ is any vector}})$$
                        However, you don't need to pay too much attention to it, because obviously $0$ is an element of $T^perp$, therefore at least
                        $$UV^top in partial|X|_*$$



                        For proof, as commented by askuyue, see https://www.sciencedirect.com/science/article/pii/0024379592904072






                        share|cite|improve this answer











                        $endgroup$













                        • $begingroup$
                          Welcome to MSE. This should be a comment, not a answer.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:06










                        • $begingroup$
                          @JoséCarlosSantos Sorry, I was not looking at your comment while I was editing the answer to elaborate. Please advise me whether this still should be moved.
                          $endgroup$
                          – diadochos
                          Apr 8 '18 at 11:18












                        • $begingroup$
                          Now it looks fine.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:37










                        • $begingroup$
                          When you write "nuclear norm is not differentiable", you copy what you read elsewhere; unfortunately (for you) it's false or at least very incomplete. Indeed, the nuclear norm is differentiable in a neighborhood of any $X$ s.t. $X^TX$ is invertible (see my post in this file). Moreover, the nuclear norm is differentiable on $C^1$ arcs $trightarrow X(t)$ s.t. $rank(X(t))$ is constant (possibly $<n$).
                          $endgroup$
                          – loup blanc
                          Apr 8 '18 at 18:03












                        • $begingroup$
                          I see. I'm sorry I was very thoughtless, and thank you very much for your guidance. I will edit the post.
                          $endgroup$
                          – diadochos
                          Apr 10 '18 at 6:15














                        1












                        1








                        1





                        $begingroup$

                        Short answer



                        Nuclear norm has subgradients (w.r.t. its argument). You may use $UV^top$ in your algorithm if you need one.



                        See https://math.stackexchange.com/a/1016743/351390 for where it is actually differentiable. You should also see the comment by loup blanc below.



                        Details



                        The elements in $partial|X|_*$ can be characterized as a sum of two parts:
                        Let $X = U Sigma V^top$ be a (skinny) singular value decomposition, then



                        $$Y in partial |X|_* quad Leftrightarrow quad Y = UV^top + W text{ for some } W in T^perp$$



                        where the definition of subspace (of matrices) $T$ is a bit complicated:
                        $$T :=text{span}({x v_j^top| j in {1, ldots, m}; x text{ is any vector}} cup {u_i y^top| i in {1, ldots, n}; y text{ is any vector}})$$
                        However, you don't need to pay too much attention to it, because obviously $0$ is an element of $T^perp$, therefore at least
                        $$UV^top in partial|X|_*$$



                        For proof, as commented by askuyue, see https://www.sciencedirect.com/science/article/pii/0024379592904072






                        share|cite|improve this answer











                        $endgroup$



                        Short answer



                        Nuclear norm has subgradients (w.r.t. its argument). You may use $UV^top$ in your algorithm if you need one.



                        See https://math.stackexchange.com/a/1016743/351390 for where it is actually differentiable. You should also see the comment by loup blanc below.



                        Details



                        The elements in $partial|X|_*$ can be characterized as a sum of two parts:
                        Let $X = U Sigma V^top$ be a (skinny) singular value decomposition, then



                        $$Y in partial |X|_* quad Leftrightarrow quad Y = UV^top + W text{ for some } W in T^perp$$



                        where the definition of subspace (of matrices) $T$ is a bit complicated:
                        $$T :=text{span}({x v_j^top| j in {1, ldots, m}; x text{ is any vector}} cup {u_i y^top| i in {1, ldots, n}; y text{ is any vector}})$$
                        However, you don't need to pay too much attention to it, because obviously $0$ is an element of $T^perp$, therefore at least
                        $$UV^top in partial|X|_*$$



                        For proof, as commented by askuyue, see https://www.sciencedirect.com/science/article/pii/0024379592904072







                        share|cite|improve this answer














                        share|cite|improve this answer



                        share|cite|improve this answer








                        edited Apr 10 '18 at 6:46

























                        answered Apr 8 '18 at 11:03









                        diadochosdiadochos

                        1135




                        1135












                        • $begingroup$
                          Welcome to MSE. This should be a comment, not a answer.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:06










                        • $begingroup$
                          @JoséCarlosSantos Sorry, I was not looking at your comment while I was editing the answer to elaborate. Please advise me whether this still should be moved.
                          $endgroup$
                          – diadochos
                          Apr 8 '18 at 11:18












                        • $begingroup$
                          Now it looks fine.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:37










                        • $begingroup$
                          When you write "nuclear norm is not differentiable", you copy what you read elsewhere; unfortunately (for you) it's false or at least very incomplete. Indeed, the nuclear norm is differentiable in a neighborhood of any $X$ s.t. $X^TX$ is invertible (see my post in this file). Moreover, the nuclear norm is differentiable on $C^1$ arcs $trightarrow X(t)$ s.t. $rank(X(t))$ is constant (possibly $<n$).
                          $endgroup$
                          – loup blanc
                          Apr 8 '18 at 18:03












                        • $begingroup$
                          I see. I'm sorry I was very thoughtless, and thank you very much for your guidance. I will edit the post.
                          $endgroup$
                          – diadochos
                          Apr 10 '18 at 6:15


















                        • $begingroup$
                          Welcome to MSE. This should be a comment, not a answer.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:06










                        • $begingroup$
                          @JoséCarlosSantos Sorry, I was not looking at your comment while I was editing the answer to elaborate. Please advise me whether this still should be moved.
                          $endgroup$
                          – diadochos
                          Apr 8 '18 at 11:18












                        • $begingroup$
                          Now it looks fine.
                          $endgroup$
                          – José Carlos Santos
                          Apr 8 '18 at 11:37










                        • $begingroup$
                          When you write "nuclear norm is not differentiable", you copy what you read elsewhere; unfortunately (for you) it's false or at least very incomplete. Indeed, the nuclear norm is differentiable in a neighborhood of any $X$ s.t. $X^TX$ is invertible (see my post in this file). Moreover, the nuclear norm is differentiable on $C^1$ arcs $trightarrow X(t)$ s.t. $rank(X(t))$ is constant (possibly $<n$).
                          $endgroup$
                          – loup blanc
                          Apr 8 '18 at 18:03












                        • $begingroup$
                          I see. I'm sorry I was very thoughtless, and thank you very much for your guidance. I will edit the post.
                          $endgroup$
                          – diadochos
                          Apr 10 '18 at 6:15
















                        $begingroup$
                        Welcome to MSE. This should be a comment, not a answer.
                        $endgroup$
                        – José Carlos Santos
                        Apr 8 '18 at 11:06




                        $begingroup$
                        Welcome to MSE. This should be a comment, not a answer.
                        $endgroup$
                        – José Carlos Santos
                        Apr 8 '18 at 11:06












                        $begingroup$
                        @JoséCarlosSantos Sorry, I was not looking at your comment while I was editing the answer to elaborate. Please advise me whether this still should be moved.
                        $endgroup$
                        – diadochos
                        Apr 8 '18 at 11:18






                        $begingroup$
                        @JoséCarlosSantos Sorry, I was not looking at your comment while I was editing the answer to elaborate. Please advise me whether this still should be moved.
                        $endgroup$
                        – diadochos
                        Apr 8 '18 at 11:18














                        $begingroup$
                        Now it looks fine.
                        $endgroup$
                        – José Carlos Santos
                        Apr 8 '18 at 11:37




                        $begingroup$
                        Now it looks fine.
                        $endgroup$
                        – José Carlos Santos
                        Apr 8 '18 at 11:37












                        $begingroup$
                        When you write "nuclear norm is not differentiable", you copy what you read elsewhere; unfortunately (for you) it's false or at least very incomplete. Indeed, the nuclear norm is differentiable in a neighborhood of any $X$ s.t. $X^TX$ is invertible (see my post in this file). Moreover, the nuclear norm is differentiable on $C^1$ arcs $trightarrow X(t)$ s.t. $rank(X(t))$ is constant (possibly $<n$).
                        $endgroup$
                        – loup blanc
                        Apr 8 '18 at 18:03






                        $begingroup$
                        When you write "nuclear norm is not differentiable", you copy what you read elsewhere; unfortunately (for you) it's false or at least very incomplete. Indeed, the nuclear norm is differentiable in a neighborhood of any $X$ s.t. $X^TX$ is invertible (see my post in this file). Moreover, the nuclear norm is differentiable on $C^1$ arcs $trightarrow X(t)$ s.t. $rank(X(t))$ is constant (possibly $<n$).
                        $endgroup$
                        – loup blanc
                        Apr 8 '18 at 18:03














                        $begingroup$
                        I see. I'm sorry I was very thoughtless, and thank you very much for your guidance. I will edit the post.
                        $endgroup$
                        – diadochos
                        Apr 10 '18 at 6:15




                        $begingroup$
                        I see. I'm sorry I was very thoughtless, and thank you very much for your guidance. I will edit the post.
                        $endgroup$
                        – diadochos
                        Apr 10 '18 at 6:15











                        0












                        $begingroup$

                        What about $|| M ||_{F} = mathrm{Trace}(M^{T}M)$?






                        share|cite|improve this answer











                        $endgroup$













                        • $begingroup$
                          $sqrt{M^{T}M} = (M^{T}M)^{ frac{1}{2}}$?
                          $endgroup$
                          – askuyue
                          Nov 11 '14 at 7:32
















                        0












                        $begingroup$

                        What about $|| M ||_{F} = mathrm{Trace}(M^{T}M)$?






                        share|cite|improve this answer











                        $endgroup$













                        • $begingroup$
                          $sqrt{M^{T}M} = (M^{T}M)^{ frac{1}{2}}$?
                          $endgroup$
                          – askuyue
                          Nov 11 '14 at 7:32














                        0












                        0








                        0





                        $begingroup$

                        What about $|| M ||_{F} = mathrm{Trace}(M^{T}M)$?






                        share|cite|improve this answer











                        $endgroup$



                        What about $|| M ||_{F} = mathrm{Trace}(M^{T}M)$?







                        share|cite|improve this answer














                        share|cite|improve this answer



                        share|cite|improve this answer








                        edited Nov 11 '14 at 9:54









                        Davide Giraudo

                        126k16150261




                        126k16150261










                        answered Nov 11 '14 at 7:28









                        askuyueaskuyue

                        16510




                        16510












                        • $begingroup$
                          $sqrt{M^{T}M} = (M^{T}M)^{ frac{1}{2}}$?
                          $endgroup$
                          – askuyue
                          Nov 11 '14 at 7:32


















                        • $begingroup$
                          $sqrt{M^{T}M} = (M^{T}M)^{ frac{1}{2}}$?
                          $endgroup$
                          – askuyue
                          Nov 11 '14 at 7:32
















                        $begingroup$
                        $sqrt{M^{T}M} = (M^{T}M)^{ frac{1}{2}}$?
                        $endgroup$
                        – askuyue
                        Nov 11 '14 at 7:32




                        $begingroup$
                        $sqrt{M^{T}M} = (M^{T}M)^{ frac{1}{2}}$?
                        $endgroup$
                        – askuyue
                        Nov 11 '14 at 7:32











                        0












                        $begingroup$

                        The challenges of calculating the gradients of $||X||_{*}$ comes from the non-smooth processing of calculating the singular values of $X$.



                        Thus, I usually first transform $X$ into a symmetric positive semi-definite matrix $hat{X}$, and then we have $||hat{X}||_{*}=tr(hat{X})$. The intuitive is that the eignvalues are equals to singular values, when $hat{X}$ is a symmetric positive semi-definite matrix, and $tr(hat{X})=sum eignvalues$.



                        Finally, we have $frac{partial ||hat{X}||_{*}}{partial hat{X}}=frac{partial tr(hat{X})}{partial hat{X}}=I$.



                        I am not sure whether it is helpful to you.






                        share|cite|improve this answer









                        $endgroup$


















                          0












                          $begingroup$

                          The challenges of calculating the gradients of $||X||_{*}$ comes from the non-smooth processing of calculating the singular values of $X$.



                          Thus, I usually first transform $X$ into a symmetric positive semi-definite matrix $hat{X}$, and then we have $||hat{X}||_{*}=tr(hat{X})$. The intuitive is that the eignvalues are equals to singular values, when $hat{X}$ is a symmetric positive semi-definite matrix, and $tr(hat{X})=sum eignvalues$.



                          Finally, we have $frac{partial ||hat{X}||_{*}}{partial hat{X}}=frac{partial tr(hat{X})}{partial hat{X}}=I$.



                          I am not sure whether it is helpful to you.






                          share|cite|improve this answer









                          $endgroup$
















                            0












                            0








                            0





                            $begingroup$

                            The challenges of calculating the gradients of $||X||_{*}$ comes from the non-smooth processing of calculating the singular values of $X$.



                            Thus, I usually first transform $X$ into a symmetric positive semi-definite matrix $hat{X}$, and then we have $||hat{X}||_{*}=tr(hat{X})$. The intuitive is that the eignvalues are equals to singular values, when $hat{X}$ is a symmetric positive semi-definite matrix, and $tr(hat{X})=sum eignvalues$.



                            Finally, we have $frac{partial ||hat{X}||_{*}}{partial hat{X}}=frac{partial tr(hat{X})}{partial hat{X}}=I$.



                            I am not sure whether it is helpful to you.






                            share|cite|improve this answer









                            $endgroup$



                            The challenges of calculating the gradients of $||X||_{*}$ comes from the non-smooth processing of calculating the singular values of $X$.



                            Thus, I usually first transform $X$ into a symmetric positive semi-definite matrix $hat{X}$, and then we have $||hat{X}||_{*}=tr(hat{X})$. The intuitive is that the eignvalues are equals to singular values, when $hat{X}$ is a symmetric positive semi-definite matrix, and $tr(hat{X})=sum eignvalues$.



                            Finally, we have $frac{partial ||hat{X}||_{*}}{partial hat{X}}=frac{partial tr(hat{X})}{partial hat{X}}=I$.



                            I am not sure whether it is helpful to you.







                            share|cite|improve this answer












                            share|cite|improve this answer



                            share|cite|improve this answer










                            answered Jan 8 at 13:51









                            闵少波闵少波

                            1




                            1






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Mathematics Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f701062%2fderivative-of-the-nuclear-norm-with-respect-to-its-argument%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                MongoDB - Not Authorized To Execute Command

                                in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

                                How to fix TextFormField cause rebuild widget in Flutter