Back Updated: 2024-08-29 Statistical Simulation, Wei Li
Suppose we have a linear system \(Ax=b\), where \(A \in \mathbb{R}^{m \times n}\) known, \(b \in \mathbb{R}^{m \times 1}\) known , and \(x \in \mathbb{R}^{n \times 1}\) unknown. Our goal is: given \(A\) and \(b\), find \(x\).
We may encounter the following questions:
1. (Existence) Is there any solution?
2. (Uniqueness) When there is an unique solution? (i.e. under what condition?)
3. What can one do if there are multiple solutions (implying infinitely many).
4. What can one do if there is no solution?
To answer the Question 1, we introduce the following result:
Theorem: \(Ax=b\) has a solution \(x\) \(\iff\) \(b \in Col(A)\).
Based on this, we have the following result to answer the Question 2:
Theorem: Suppose \(b \in Col(A)\), then \(Ax=b\) has a unique solution \(\iff\) columns of \(A\) are all linearly independent (\(m \ge n\)).
Special case: If \(A\) is square matrix that has linearly independent columns, then there exists a unique solution to \(Ax=b\) for any \(b \in \mathbb{R}^m\), which is \(x=A^{-1}b\).
Question 3, if there are multiple solutions, one may characterize the solution space completely, or when a unique solution is desired, regularization technique can be applied or additional constraint may be imposed. This depends on the applications and will not be further discussed here. An alternative is to cast the problem as a least square minimization problem and then solve for the solution with some regularization, which will be covered under the question 4.
Given \(A \in \mathbb R^{m\times n}\), assuming there exists one unique solution, then solve for \(x\). The well-known methods include:
We are not going to discuss Gaussian elimination here. You can find more details in any algebra textbook.
Theorem (Cholesky factorization/decomposition): If symmetric matrix \(A>0\) (positive-definite), then there exists a unique upper triangular matrix \(U\) s.t. \(A=U^{\top}U\); diagonal entries of \(U\) are strictly positive.
Application 1: Multivariate Normal
If \(Z_1,Z_2,\dots,Z_p\overset{i.i.d}\sim {N}(0,1)\), \(Z=(Z_1, \ldots, Z_p)^\top \sim {N}(0,I)\).
Also, by the well known property of multivariate normal distribution, \(\mu+AZ\sim {N}(\mu,AA^\top),\) \[\mu= \begin{bmatrix} \mu_1\\ \mu_2\\ \mu_3 \end{bmatrix}, \qquad AA^\top= \begin{bmatrix} \sigma_1^2 & \sigma_{12}^2 & \cdots & \sigma_{1p}^2\\ \sigma_{21}^2 & \sigma_{2}^2 & \cdots & \sigma_{2p}^2\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{p1}^2 & \sigma_{p2}^2 & \cdots & \sigma_{p}^2 \end{bmatrix}.\] where \(AA^\top\) is the variance/covariance matrix.
Goal: To generate (multivariate) \(W\sim {N}(\mu,\sum)\). Take Cholesky decomposition of \(\sum=U^\top U\), then \(\mu+U^\top Z \sim {N}(\mu, U^\top U).\)
Application 2: Given \(A>0\) (hence invertible), solve for \(Ax=b\). There are three steps in the algorithm.
Compute \(A=U^{\top}U\). (Then we have \(U^{\top}Ux=b\), denote \(Ux=y\), now \(U^{\top}y=b\)).
Forward solve \(U^{\top}y=b\) to get \(\hat y\). (Here \(U^{\top}\) is a lower triangular matrix).
Back solve \(Ux=\hat y\) to get \(\hat x\). (Here \(U\) is an upper triangular matrix).
Usage: In R, we use chol(A) to implement Cholesky decomposition to matrix \(A\). In Python, we use numpy functions linalg.cholesky and linalg.solve. If we have \(Ux=b\) where \(U\) is an upper triangular matrix, we use \(x \leftarrow backsolve (U,b)\). And if we have \(Lx=b\) where \(L\) is a lower triangular matrix, we use \(x \leftarrow forwardsolve (L,b)\).
Theorem (Thin/Reduced QR decomposition): If \(A \in \mathbb{R}^{m \times n}\) (\(m \ge n\)) has \(n\) linearly independent columns, then \(A=Q_1 R\), where \(Q_1 \in \mathbb{R}^{m \times n}\) has orthonormal columns (hence \(Q_1^{\top}Q_1=I\)), and \(R \in \mathbb{R}^{n \times n}\) is an upper triangular positive definite matrix.
Remark: The matrix \(Q_1\) provides an orthonormal basis for \(Col(A)\), i.e., the columns of A are linear combinations of the columns of \(Q_1\). In fact, we have \(Col(A)=Col(Q_1)\), and any partial set of columns satisfy the same property, i.e., \[ span\{q_1, \cdots, q_k\}=span\{a_1, \cdots, a_k\} \quad k=1,2,\cdots,n. \] And from the proof of the theorem we can tell \(q_k \perp \{a_1,a_2,\cdots, a_{k-1}\}, k\geq 2\).
Theorem (Full QR decomposition): With \(A=Q_1R\) as above, write
\[ A=\begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R \\ 0 \end{bmatrix} =Q\begin{bmatrix} R \\ 0 \end{bmatrix}, \] where \(Q\in \mathbb R^{m \times m}\) is orthogonal matrix (i.e. \(Q^{\top} Q=Q Q^{\top}=I\)), \(Q_2\) is a matrix containing columns all orthogonal to that of \(Q_1\) and \(\begin{bmatrix}R \\0\end{bmatrix}\in \mathbb R^{m \times n}\).
Note that to make sure \(Q=\begin{bmatrix} Q_1 & Q_2 \end{bmatrix}\) is orthogonal, we are appending additional orthonormal columns to \(Q_1\) to get \(Q\), which means columns of \(Q_2\) are orthonormal, orthogonal to \(Q_1\). Then we also know columns of \(Q_2\) orthogonal to that of \(A\), since \(A=Q_1 R\).
Example: Consider \(A=\begin{bmatrix}a_1 & a_2\end{bmatrix}=\begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}\), find thin and full QR decomposition of \(A\), i.e. \(A=Q_1 R= Q\begin{bmatrix} R \\ 0 \end{bmatrix}\).
Application: Solving linear system \(Ax=b\), where \(A\in \mathbb R^{m\times n}\) has linearly independent columns (hence the system has a unique solution). There are three steps in the algorithm.
Do QR decomposition (thin version) to get \(A=Q_1 R\).
Then \(Q_1Rx=b \implies Rx=Q_1^{\top}b\). Denote \(y=Q_1^{\top}b\), we can compute \(y\).
Back solve \(Rx=y\) to get \(\hat x\).
Goal: \(Ax=b\), given \(A\) and \(b\), to find \(x\).
When there is no solution: \(b\not \in \operatorname{Col} (A)\).
In this case, a natural problem to consider is the linear least square problem: \[ \min_x\|Ax-b\|. \] Our strategy is to find a \(\hat x\) such that \(A\hat x=\hat b\) is a projection of b on \(\operatorname{Col}(A)\). Then \(\hat x\) is an approximate solution of the original problem \(Ax=b\), but only in the sense that \(Ax\) is close to \(b\) in \(L_2\) norm. Note however, this is from the point of view of optimization. From statistical point of view, least square problem is primarily driven by the goal to minimize some “expected” error – mean-squared error between random objects.
Consider \(\hat b =A\hat x\). Obtain \(\hat b\) by dropping a perpendicular line from \(b\) to \(\operatorname{Col}(A)\), that is to say \(\left\langle r,\hat a\right\rangle = 0, \hat a \in \operatorname{Col}(A)\), where \(r = b-\hat b\) is the residual vector. In particular, \[\begin{align*} \left\langle r,a_i\right\rangle = 0,i=1,2,\cdots,n &\iff A^\top r=0\\ &\iff A^\top (b-A\hat x)=0\\ &\iff A^\top A \hat x=A^\top b \end{align*}\]
Now \(\hat x=(A^\top A)^{-1}A^\top b\) if \(A^\top A\) is invertible, and then \(\hat b = A(A^\top A)^{-1}A^\top b\).
Note: \(A^\top A \hat x=A^\top b\) is called Normal Equation.
Fact 1: Normal equation always has at least one solution.
Fact 2: \(A^\top A\) is invertible if and only if \(A\) has full-column rank. In particular, when \(m<n\), \(A^\top A\) is not invertible.
Fact 3: The solution to the normal equation is not necessary a solution to the original problem \(Ax=b\). For \(b\in Col(A)\), then \(A^\top A=A^\top b\) and \(Ax=b\) have the same unique solution if and only if \(A\) has full column rank.
Regularization (ridge regression): Solve \((A^\top A+\lambda I)x=A^\top b\) for \(\lambda>0\) is called. In this case, \(A^\top A+\lambda I\) is always invertible, and we have \[\hat x = (A^\top A+\lambda I)^{-1}A^\top b.\]
Application of QR to solve a linear least square problem: \(\min_x \|Ax-b\|\).
\(A\) columns are linearly independent. \(A^\top Ax=A^\top b\). Decompose matrix \(A\) by using QR decomposition \(A=Q_1R\):
\[\begin{align*} A^\top Ax=A^\top b &\Leftrightarrow R^\top Q_1^\top Q_1Rx=R^\top Q_1^\top b\\ &\Leftrightarrow R^\top Rx=R^\top Q_1^\top b\\ &\Leftrightarrow (R^\top )^{-1}R^\top Rx=(R^\top )^{-1}R^\top Q_1^\top b\\ &\Leftrightarrow Rx=Q_1^\top b=y \end{align*}\]
Algorithm:
Back Updated: 2024-08-29 Statistical Simulation, Wei Li