Back Updated: 2024-09-09 Statistical Simulation, Wei Li
The principal component analysis (PCA) is concerned with explaining the variance-covariance structure of \({X}=\left({X}_{1}, \cdots, {X}_{p}\right)^\top\) through a few linear combinations of these variables.
Define the random vector and its mean vector \[ X=\left(X_{1}, \cdots, X_{p}\right)^{\top}, \quad \boldsymbol{\mu}=E(X)=\left(\mu_{1}, \cdots, \mu_{p}\right)^{\top}. \] The variance-covariance matrix of \(X\) is the \[ \Sigma=\operatorname{Cov}(X)=E(X-\boldsymbol{\mu})(X-\boldsymbol{\mu})^{\top}, \] its \(i j\) -th entry \(\sigma_{i j}=E\left(X_{i}-\mu_{i}\right)\left(X_{j}-\mu_{j}\right)\) for any \(1 \leq i \leq j \leq p\), where \(\boldsymbol{\mu}\) is the population mean and \(\Sigma\) is the population variance-covariance matrix. In practice, \(\boldsymbol{\mu}\) and \(\Sigma\) are unknown and estimated from the data.
Consider the linear combinations \[\begin{align*} Z_{1} &=\mathbf{v}_{1}^{\top} {X}=v_{11} X_{1}+v_{21} X_{2}+\cdots+v_{p1} X_{p} \\ Z_{2} &=\mathbf{v}_{2}^{\top} {X}=v_{12} X_{1}+v_{22} X_{2}+\cdots+v_{p2} X_{p} \\ \vdots &=\qquad \qquad \vdots \\ Z_{p} &=\mathbf{v}_{p}^{\top} {X}=v_{1p} X_{1}+v_{2p} X_{2}+\cdots+v_{p p} X_{p} \end{align*}\] In a matrix form,
\[ Z= V^\top X, \] where \(Z=(Z_1, \ldots, Z_p)^\top\), \(V=(\mathbf{v}_1, \ldots, \mathbf{v}_p)\). Then \[ \begin{array}{l} \operatorname{Var}\left(Z_{j}\right)=\mathbf{v}_{j}^{\top} \Sigma \mathbf{v}_{j}, \quad j=1, \cdots, p, \\ \operatorname{Cov}\left(Z_{j}, Z_{k}\right)=\mathbf{v}_{j}^{\top} \Sigma \mathbf{v}_{k}, \quad \forall j \neq k. \end{array} \] Principal component analysis is a procedure that finds directions (\(\mathbf{v}_{j}, j=1,\ldots,p\)) such that \(Z_{1}, \cdots, Z_{p}\) has maximum variability and also are linearly uncorrelated and uses these \(Z_j\)’s for subsequent statistical analysis.
Principal components (PCs):
The procedure can be described as
The first \(\mathrm{PC}=\) linear combination \(Z_{1}=\mathbf{v}_{1}^{\top} X\) that maximizes \(\operatorname{Var}\left(\mathbf{v}_{1}^{\top} X\right)\) subject to \(\left\|\mathbf{v}_{1}\right\|=1.\)
The second \(\mathrm{PC}=\) linear combination \(\mathrm{Z}_{2}=\mathbf{v}_{2}^{\top} \mathrm{X}\) that maximizes \(\operatorname{Var}\left(\mathbf{v}_{2}^{\top} X\right)\) subject to \(\left\|\mathbf{v}_{2}\right\|=1\) and \(\operatorname{Cov}\left(\mathbf{v}_{1}^{\top} X, \mathbf{v}_{2}^{\top} X\right)=0\).
The \(j\)-th PC satisfies \[\begin{align*} \max & \operatorname{Var}\left(\mathbf{v}_{j}^{\top} X \right) \\ \text { subject to } &\left\|\mathbf{v}_{j}\right\|=1 \\ & \operatorname{Cov}\left(\mathbf{v}_{l}^{\top} X, \mathbf{v}_{j}^{\top} X \right)=\mathbf{v}_{l}^{\top} \Sigma \mathbf{v}_{j}=0 \\ & \text { for } l=1, \ldots, j-1 \end{align*}\] where \(j=2, \cdots, p\).
It turns out such a matrix \(V\) can be found using eigen-decomposition of \(\Sigma\): assume \(\Sigma\) has \(p\) eigenvalue-eigenvector pairs \((\lambda, \mathbf{v})\) satisfying: \[ \Sigma \mathbf{v}_{j}=\lambda_{j} \mathbf{v}_{j}, \quad j=1 \cdots, p \] where \(\lambda_{1} \geq \lambda_{2} \geq \cdots \geq \lambda_{p}\geq 0\) and \(\left\|\mathbf{v}_{j}\right\|=1\) for all \(j .\) This gives the following spectral decomposition
\[\begin{align*} \Sigma=\sum_{j=1}^{p} \lambda_{j} \mathbf{v}_{j} \mathbf{v}_{j}^{\top} = V \left[\begin{array}{cccc} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_p \end{array}\right]V^\top. \end{align*}\]
We may view \(X\) is a function of \(Z\). Since \(V\) is orthogonal matrix, we also have \(X=VZ\), therefore, \[ X=\sum_{k=1}^p \mathbf{v}_{k}Z_k, \] where each \(Z_k\) can be viewed the coordinate representation of \(X\) in the coordinate system given the columns of \(V\).
Note also that \(\operatorname{Var}(X)=\sum_{j=1}^p \lambda_j \mathbf{v}_j \mathbf{v}^\top_j\). So \(\operatorname{Cov}(X_s, X_t)=\sum_{j=1}^p \lambda_j v_{sj} v_{tj}\). Similarity of the PC loadings for the first few components between different \(X_j\)’s can be heuristically viewed as a degree of linear association. Note \(\operatorname{Cov}(X, Z)=V\operatorname{Var}(Z)=[\lambda_1 \mathbf{v}_1, \ldots, \lambda_p \mathbf{v}_p]\), variables \(X_j\)’s that are correlated with the first few principal components are the most important in explaining the variability in the data set.
Note \(\operatorname{Var}(X_i)=\sum_{j=1}^p\lambda_jv_{ij}^2\). The square of the loading value represents the variance of the variable explained by the principal component, can be heuristically understood to be the importance of the variable. The contribution of the variable \(X_i\) to the \(j\)-th component can be measured by \(v_{ij}^2/\sum_{i=1}^pv_{ij}^2=v_{ij}^2\). The contribution of \(X_i\) in accounting for variation in the first \(l\) component can be measured by \((\sum_{j=1}^l \lambda_{j} v_{ij}^2)/(\sum_{i=1}^p\sum_{j=1}^l \lambda_j v_{ij}^2)= (\sum_{j=1}^l \lambda_{j} v_{ij}^2)/ (\sum_{j=1}^l \lambda_j).\)
Suppose we have \({X}_{i}=\left(X_{i 1}, \cdots, X_{i p}\right)^{\top}\) for \(i=1, \cdots, n\). If \(\Sigma\) is unknown, we use sample vcariance \(S_{n}\) as its estimator. \[\begin{align*} \mathbf{X}=\left[\begin{array}{c} \boldsymbol{x}_{1}^{\top} \\ \boldsymbol{x}_{2}^{\top} \\ \vdots \\ \boldsymbol{x}_{n}^{\top} \end{array}\right] = \left[\begin{array}{c} \mathbf{x}_{1},\mathbf{x}_{2} \cdots \mathbf{x}_{p} \end{array}\right], \end{align*}\] \[ S_{n}=\frac{1}{n-1} \mathbf{X}_{c}^{\top} \mathbf{X}_{c}=\frac{1}{n-1} \sum_{i=1}^{n}\left({X}_{i}-\overline{{X}}\right)\left({X}_{i}-\overline{{X}}\right)^{\top}, \] where \(\mathbf{X}_{c}\) the centered design matrix of \(\mathbf{X}\).
There are two ways to find \(V=(\mathbf{v}_1, \ldots, \mathbf{v}_p)\):
Specifically, in the following discussion, suppose we have \(n \times p\) demeaned data matrix \(\mathbf{X}\) with \(n \geq p\) (we assume \(\mathbf{X}\) it to be in mean-deviance form \(\mathbf{X}_c\)), by the full SVD (or use thin SVD):
Now, we project each row \(\boldsymbol{x}_i\) of \(\mathbf{X}\) onto the first PC direction \(\mathbf{v}_{1}\) to obtain \(z_{i1}=\mathbf{v}_{1}^\top\boldsymbol{x}_i\). By collecting \(\{z_{i,1}\}_{i=1}^n\), we can form a column vector \(\mathbf{z}_{1}\) the first principle components of \(\mathbf{X}\), We do this for the second, third, .. and the \(p\)-th PC direction: \[ \textbf{the j-th principle component: } \qquad \mathbf{z}_j= \mathbf{X} \mathbf{v}_j \qquad (n \times 1) , \qquad j=1, \cdots, p \] where the \(i\)-th entry \(z_{ij}=\mathbf{v}_{j}^\top\boldsymbol{x}_i\) is called the i-th score of the j-th PC of \(\mathbf{X}\). The scores indicate how each observation projects onto the principal components. The contribution of an individual \(i\)-th case to the \(j\)-th principal component can be assessed by \(z_{ij}^2/\sum_{k=1}^nz_{kj}^2\).
In a data matrix form, we have \(\mathbf{Z}=\mathbf{X}V,\) where \[\begin{align*} \mathbf{Z}= \left[\begin{array}{c} \mathbf{z}_{1},\mathbf{z}_{2} \cdots \mathbf{z}_{p} \end{array}\right] = \left[\begin{array}{c} \boldsymbol{z}_{1}^{\top} \\ \boldsymbol{z}_{2}^{\top} \\ \vdots \\ \boldsymbol{z}_{n}^{\top} \end{array}\right]. \end{align*}\]
Note the \(p \times 1\) vector \(\boldsymbol{z}_{i}=V^\top \boldsymbol{x}_i\) is the \(i\)-th score of the principal components of \(\mathbf{X}\), \(i=1, \cdots, n\).
Geometric meaning: By \(V\) being a orthogonal matrix, \(\boldsymbol{x}_i=V\boldsymbol{z}_i\), so \(\boldsymbol{z}_{i}\) can be viewed as the new coordinate values of \(\boldsymbol{x}_i\) in the p-dimensional coordinate system given by \(V\) instead of \([e_1,\ldots,e_p]\).
It is easy to show that principal components \(\mathbf{z}_{j}\) have maximum variance \(d_{j}^{2} / (n-1),\) subject to being orthogonal to the earlier ones.
Remarks:
Concerns:
Note the total variance of \(\mathbf{X}\) is \[ \frac{1}{n-1}\operatorname{tr}(\mathbf{X}^\top\mathbf{X})=\frac{1}{n-1}\operatorname{tr}(\mathbf{Z}^\top\mathbf{Z})=\frac{1}{n-1}\sum_{j=1}^rd_j^2. \] So the fraction of total varianc explained by the \(j\)-th PC \(\mathbf{z}_j\) is \(d_j^2/\sum_{j=1}^r d_j^2\).
One application for the principal components is the principal component regression:
Ordinary Least Square (OLS)
\(\underset{\beta}{\text{min}}\|\mathbf{X}\beta-\mathbf{y}\|^2\)
Principal Components Regression (PCR)
\(\underset{\theta}{\text{min}}\|\mathbf{Z}\theta-\mathbf{y}\|^2\)
PCA technique is particularly useful when there is high multi-colinearity exists between the features/variables. By construction, the first principal component will contain the most information about the data, the subsequent principal components contain less and less information about the data while being uncorrelated. The first few principal components \(\mathbf{z}_1,\mathbf{z}_2,\dots,\mathbf{z}_k\) \((k<p)\) can be used as predictors in lieu of the original set of all predictors in \(X\).
One particular drawback about \(PCR\) is that the formation of the principal components for \(X\) does not take into account any relationship between \(X\) and \(Y\). One remedy is the Partial Least Squares (PLS). Instead of focusing only on explaining the variance in \(\mathbf{X}\), PLS takes into account the response \(\mathbf{y}\). The PLS components are linear combinations of the original predictors, but they are chosen in a way that also explains variance in \(\mathbf{y}\).
The 1st-PLS component is \(Z_1=\mathbf{v}_1^\top X\) where \(\mathbf{v}=(\operatorname{corr}(X_1, Y), \cdots, \operatorname{corr}(X_p, Y))^\top\); The \(j\)-PLS (\(j>2\)) component \(Z_j=\mathbf{v}_j^\top X\) is obtained by finding the \(j\)-PLS directions \(\mathbf{v}_j\) based on \[\begin{align*} \max & \operatorname{Var}\left(\mathbf{v}_j^\top X\right) \operatorname{Corr}^2(Y, \mathbf{v}_j^\top X) \\ \text { subject to } & \left\|\mathbf{v}_j\right\|=1 \\ & \operatorname{Cov}\left(\mathbf{v}_l^\top X, \mathbf{v}_j^\top X\right)=\mathbf{v}_l^\top \Sigma \mathbf{v}_j=0 \\ & \text { for } l=1, \ldots, j-1 \end{align*}\]
Assume \(l \ll p\). We like to approximate each \(\boldsymbol{x}_i \in \mathbb{R}^p\) with some possibly lower dimensional representation \(\boldsymbol{z}_i \in \mathbb{R}^l\) or latent factor or embedding, so that \(\boldsymbol{x}_i \approx \sum_{k=1}^l \mathbf{v}_{k}{z}_{ik}\), where each \(\mathbf{v}_{k} \in \mathbb{R}^p\) and orthogonal to each other. Minimize \(V\) in the following function over the restriction that \(V\) has orthonormal columns \[\begin{align*} \mathcal{L}({V})=\frac{1}{n}\left\|\mathbf{X}-\mathbf{Z} {V}^{\top}\right\|_F^2=\frac{1}{n} \sum_{i=1}^n\left\|\boldsymbol{x}_i-{V} \boldsymbol{z}_i\right\|^2. \end{align*}\] Here \(\| \cdot \|_F\) is Frobenius form, and the data matrix of latent factors \(\mathbf{Z}\) each row contains the low dimension versions of the rows of \(\mathbf{X}\).
If we consider \(\boldsymbol{z}_i= \tilde{V}^{\top} \boldsymbol{x}_i\), solving \[\begin{align*} \mathcal{L}({V}, \tilde{V})=\frac{1}{n}\left\|\mathbf{X}-\mathbf{X}\tilde{V} {V}^{\top}\right\|_F^2. \end{align*}\]
the optimal solution under the contraint is given by \(V^*=\tilde{V}^*=V_{1:l}\), where \(V_{1:l}\) is a \(p\) by \(l\) matrix contains the \(l\) eigenvectors with largest eigenvalues of the sample covariance matrix \(S_n\). An interesting result is that this optimization problem is equivalent to the previous one we described for maximizing contrained variance. This is can be done using SVD low rank approximation.
One can view the mapping \(\boldsymbol{x}_i \mapsto \boldsymbol{z}_i=\tilde{V}^{\top}\boldsymbol{x}_i\) as a linear encoder, while \(\boldsymbol{z}_i \mapsto \hat{\boldsymbol{x}}_i={V}\boldsymbol{z}_i\) as a linear decoder. Note the optimal solution satisfies \(\frac{1}{n}\left\|\mathbf{X}-\mathbf{X}V {V}^{\top}\right\|_F^2\) such that \(V^\top V=I\), so the error is just a kind of reconstruction error.
Note also the reconstruction is purely deterministic (fixing \(V\) or estimated \(V\)): given \(\boldsymbol{z}_i\), the \(\boldsymbol{x}_i\) is fully determined. A probabilistic PCA is a probabilistic version of PCA that allows generation of \(\boldsymbol{x}_i\): given \(\boldsymbol{z}_i\), \(\boldsymbol{x}_i\) can be generated according to some specified distribution.
Provided and discussed in lab.
Back Updated: 2024-09-09 Statistical Simulation, Wei Li