Problem: State the form of the singular value decomposition (SVD) of an arbitrary linear operator \(X:\mathbf C^m\to\mathbf C^n\).
Solution: The SVD of \(X\) is given by:
\[X=U_2\Sigma U^{\dagger}_1\]
where \(U_1\in U(m)\) and \(U_2\in U(n)\) are unitary operators and \(\Sigma:\mathbf C^m\to\mathbf C^n\) is a “Hermitian” positive semi-definite diagonal operator. Strictly speaking, one should prove that such a factorization of “rotate-stretch-rotate” really does exist for any \(X\); the details are omitted here.
Problem: In practice, how does one find the SVD of a linear operator \(X\)?
Solution: The idea is to look at the \(2\) Hermitian, positive semi-definite operators constructed from \(X\):
\[XX^{\dagger}=U_2\Sigma\Sigma^{\dagger}U_2^{\dagger}\]
\[X^{\dagger}X=U_1\Sigma^{\dagger}\Sigma U_1^{\dagger}\]
And in particular, by the spectral theorem, both \(XX^{\dagger}\) and \(X^{\dagger}X\) have an orthonormal eigenbasis (of \(\mathbf C^n\), \(\mathbf C^m\) respectively) corresponding to real, non-negative eigenvalues \(\sigma^2_1>\sigma^2_2>…>\sigma^2_{\text{min}(n,m)}\), so since the above matches the standard eigendecomposition template, it’s clear that computing the SVD of \(X\) just amounts to diagonalizing \(XX^{\dagger}\) and \(X^{\dagger}X\) (actually just \(1\) has to be diagonalized, namely diagonalize \(XX^{\dagger}\) if \(n\leq m\) and diagonalize \(X^{\dagger}X\) if \(m\leq n\). Then, because the singular eigenvalues \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\) are identical for both \(XX^{\dagger}\) and \(X^{\dagger}X\), it’s easy to find the eigenvectors of the other).
Problem: The most important application of SVD is to estimate a complicated linear operator \(X:\mathbf C^m\to\mathbf C^n\) by a low-rank approximation. Explain how this is achieved.
Solution: Order the \(\text{min}(n,m)\) singular values of \(X:\mathbf C^m\to\mathbf C^n\) from greatest to least as before \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\). Write the column eigenvectors \(U_1=(\mathbf u_1^{(1)}, \mathbf u_1^{(2)},…,\mathbf u_1^{(m)})\) and \(U_2=(\mathbf u_2^{(1)}, \mathbf u_2^{(2)},…,\mathbf u_2^{(n)})\). Then the singular value decomposition may also be written:
\[X=U_1\Sigma U^{\dagger}_2=\sum_{i=1}^{\text{min}(n,m)}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]
In particular, the Eckart-Young theorem states the best rank-\(r\) approximation to \(X\) (for \(r\leq\text{rank}(X)\)) is given by a truncating the above series at the first \(r\) terms:
\[X\approx \sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]
Here, the word “best” is with respect to the Frobenius norm:
\[\text{argmin}_{\tilde X\in\mathbf C^{n\times m}:\text{rank}(\tilde X)=r}\sqrt{\text{Tr}((X-\tilde X)^{\dagger}(X-\tilde X))}=\sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]
(insert proof here:)
Problem: Here is another application of SVD: given a diagonal operator \(\Sigma:\mathbf C^m\to\mathbf C^n\), explain how to compute its (Moore-Penrose) pseudoinverse \(\Sigma^+:\mathbf C^n\to\mathbf C^m\). Hence, how can the pseudoinverse of a general linear operator \(X:\mathbf C^m\to\mathbf C^n\) be computed?
Solution: \(\Sigma^+\) is almost like the usual inverse \(\Sigma^{-1}\) in that one simply inverts all the eigenvalues on the diagonal. But the exception is if \(\Sigma\) has a zero eigenvalue…then ordinarily \(\Sigma^{-1}\) would not exist, so for this case the pseudoinverse is created and the instruction is to simply leave the zeros be (rather than trying to invert them and get \(1/0=\infty\)). Finally, also don’t forget to transpose the result. For a general \(X\), the quickest way to compute the pseudoinverse is to use its SVD:
\[X=U_2\Sigma U^{\dagger}_1\]
Ordinarily trying to take the inverse (even if \(n\neq m\)) would appear to give:
\[X^{-1}=U_1\Sigma^{-1}U^{\dagger}_2\]
But since \(\Sigma^{-1}\) may not exist, the pseudoinverse of \(X\) is thus given by:
\[X^+=U_1\Sigma^+U_2^{\dagger}\]
(of course, if \(X^{-1}\) actually exists then the pseudoinverse is in fact not “pseudo” at all \(X^+=X^{-1}\)).
Problem: In general, given a vector \(\mathbf y\in\mathbf C^n\) and linear operator \(X:\mathbf C^m\to\mathbf C^n\), the equation \(\mathbf y=X\mathbf x\) only has a unique solution for \(\mathbf x\in\mathbf C^m\) provided that \(X\) is invertible, i.e. \(\mathbf x=X^{-1}\mathbf y\). If this isn’t the case however, one can still compute the pseudoinverse \(X^+\); what interpretation does \(X^+\mathbf y\in\mathbf C^m\) have?
Solution: There are \(2\) cases; if \(m\leq n\), then there will be many solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\), and in particular \(X^+\mathbf y\) is also an element of this solution space (i.e. \(XX^+\mathbf y=\mathbf y\)) but has the additional property that it is closest to the origin:
\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m:\mathbf y=X\mathbf x}|\mathbf x|\]
By contrast, if \(m\geq n\), then there will be no solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\) and in that case of course \(X^+\mathbf y\) is also not a solution, but nevertheless it is the closest one can come to an actual solution in the sense that:
\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m}|\mathbf y-X\mathbf x|\]
Problem: Explain the what and how of conducting principal component analysis (PCA) of a bunch of (say \(m\)) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\).
Solution: Earlier, \(X:\mathbf C^m\to\mathbf C^n\) simply had the interpretation of an arbitrary linear operator, in which case the interpretation of \(XX^{\dagger}\) and \(X^{\dagger}X\) were both likewise arbitrary. However, one intuitive interpretation is to view \(X\) as a data matrix; if one takes the \(m\) columns of \(X=(\mathbf x_1,…,\mathbf x_m)\) to represent the \(m\) feature vectors in \(\mathbf C^n\), then the Gram matrix \(X^{\dagger}X\) takes on a statistical interpretation as an autocorrelation matrix:
\[X^{\dagger}X=
\begin{pmatrix} |\mathbf{x}_1|^2 & \cdots & \mathbf{x}_1^{\dagger}\mathbf{x}_m \\
\vdots & \ddots & \vdots \\
\mathbf{x}_m^{\dagger}\mathbf{x}_1 & \cdots & |\mathbf{x}_m|^2
\end{pmatrix}\]
or alternatively, if \(X\mapsto X-\boldsymbol{\mu}\otimes\mathbf{1}_m\) has already been mean-subtracted (where \(\boldsymbol{\mu}=\sum_{i=1}^m\mathbf x_i/m\)) so as to have zero-mean, then \(X^{\dagger}X\) is just the covariance matrix of the data. Based on the earlier discussion of SVD, it follows that the columns of \(U_1\in U(m)\) give principal axes of the covariance matrix \(X^{\dagger}X\), and that the corresponding singular values in \(\Sigma\) measure the standard deviation of the data along those principal axes (as variance is commonly denoted \(\sigma^2\), this also explains the choice of notation in the SVD). Note that in such applications, one is almost always in the regime \(m\gg n\) so that there will be \(n\) singular values \(\sigma_1,…,\sigma_n\). Together, each principal axis together with corresponding singular value is said to be a principal component (PC) of the data \(X\), hence the name. The “analysis” part comes from analyzing how the total variance \(\sum_{i=1}^n\sigma_i^2\) of the data \(X\) is explained by each PC, specifically the fraction of it explained by the \(j^{\text{th}}\) PC is \(\sigma_j^2/\sum_{i=1}^n\sigma_i^2\). If the singular values are ordered \(\sigma_1\geq…\geq\sigma_n\), then PCA also provides a lossy data compression algorithm, specifically, in analogy to the low-rank approximation application of SVD, one can pick some \(r< n\) and project all \(m\) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\) onto the \(r\)-dimensional subspace spanned by the corresponding first \(r\) principal axes associated to the first \(r\) singular values \(\sigma_1,…,\sigma_r\).
If the data matrix contained the feature vectors as its rows instead \(X=(\mathbf x_1,…,\mathbf x_m)^{\dagger}\), then repeat the above discussion with \(XX^{\dagger}\) instead.