Dimensionality Reduction for Wide Data Matrices
You are given a high-dimensional data matrix $X \in \mathbb{R}^{N \times P}$ where $P \gg N$ -- many more features than observations.
From a linear algebra perspective:
- What does the rank of $X$ tell you about the effective dimensionality of the data? Why does $P \gg N$ immediately constrain the rank?
- How would you use the SVD of $X$ to find the top $k$ directions of maximum variance? Why is it more efficient to work with the $N \times N$ Gram matrix $XX^T$ rather than the $P \times P$ covariance matrix $X^TX$ when $P \gg N$?
- How does the Gram-Schmidt process relate to finding an orthonormal basis for the image (column space) of $X^T$?
Hints
- When $P \gg N$, the rank is at most $N$. This means the data lies in a low-dimensional subspace of $\mathbb{R}^P$ regardless of how many features there are.
- Compare the cost of eigendecomposing $X^TX$ ($P \times P$, cost $O(P^3)$) versus $XX^T$ ($N \times N$, cost $O(N^3)$). The latter is the kernel trick for PCA.
- Gram-Schmidt on the columns of $X^T$ gives an orthonormal basis for the image of $X^T$. The SVD goes further by ordering basis vectors by singular value (variance explained).
Worked Solution
How to Think About It: When $P \gg N$, the data lives in a space with far more dimensions than observations. But here is the key insight: $N$ data points can span at most an $N$-dimensional subspace of $\mathbb{R}^P$. So no matter how many features you have, all the "action" happens in a subspace of dimension at most $N$. The question is how to find that subspace efficiently without ever working in the full $P$-dimensional space.
Key Insight: The rank constraint $\text{rank}(X) \le \min(N, P) = N$ means you can always reduce to at most $N$ dimensions without losing any information. The computational trick is to work with the small $N \times N$ matrix rather than the huge $P \times P$ one.
The Method:
*Part 1 -- Rank and effective dimensionality:*
The rank of $X$ satisfies $\text{rank}(X) \le \min(N, P) = N$. This means the $N$ rows of $X$ (data points in $\mathbb{R}^P$) span a subspace of dimension at most $N$. In practice, if some features are correlated, the effective rank may be even lower. The rank tells you the true number of independent directions of variation in the data.
*Part 2 -- SVD and the kernel trick:*
The SVD gives $X = U \Sigma V^T$ where $\Sigma$ has at most $N$ nonzero singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r > 0$. The right singular vectors $v_1, \dots, v_k$ (columns of $V$) are the principal component directions -- they capture the most variance.
To find these, you could eigendecompose the $P \times P$ covariance matrix $X^TX = V \Sigma^2 V^T$, but this is $O(P^3)$ -- prohibitive when $P$ is huge. Instead, compute the $N \times N$ Gram matrix $G = XX^T = U \Sigma^2 U^T$. Eigendecompose $G$ (only $O(N^3)$) to get $U$ and $\Sigma$. Then recover the principal directions via $V = X^T U \Sigma^{-1}$. The projected data (PC scores) are $U\Sigma$ directly.
This is the "kernel trick" in PCA: you never need to form or eigendecompose the $P \times P$ matrix.
*Part 3 -- Gram-Schmidt:*
The columns of $X^T$ are the $N$ data points viewed as vectors in $\mathbb{R}^P$. The Gram-Schmidt process orthogonalizes these to produce an orthonormal basis $\{q_1, \dots, q_r\}$ for the image (column space) of $X^T$, where $r = \text{rank}(X)$. This gives an orthonormal basis for the subspace containing all the data variation, but unlike SVD, it does not order the basis by variance explained.
Practical Considerations: - PCA via the Gram matrix is the go-to when $P \gg N$. Common in genomics ($P \sim 20{,}000$ genes, $N \sim 200$ samples) and NLP (bag-of-words with huge vocabularies). - For nonlinear structure, t-SNE/UMAP project to 2-3D for visualization but do not preserve global distances. - Autoencoders learn nonlinear low-dimensional representations. - In finance, PCA on a large covariance matrix of returns typically reveals 3-5 dominant factors (market, size, value, etc.).
Answer: The rank constraint $\text{rank}(X) \le N$ guarantees the data lives in at most an $N$-dimensional subspace. Use the SVD (or equivalently, eigendecompose the $N \times N$ Gram matrix $XX^T$) to find the directions of maximum variance efficiently. Gram-Schmidt provides an orthonormal basis for the data subspace but without the variance ordering that SVD gives.
Intuition
The fundamental insight here is that data dimensionality is not about how many features you measure -- it is about how many independent directions of variation exist. With $N$ observations, you can have at most $N$ independent directions. The rest of the $P$-dimensional space is just noise or redundancy. PCA via SVD finds those directions ordered by importance, and the Gram matrix trick makes this computationally feasible even when $P$ is enormous.
This shows up constantly in quantitative finance. When you have a universe of 5,000 stocks but only 252 trading days of data, your return matrix has rank at most 252. Running PCA on the Gram matrix (252 x 252) is vastly cheaper than on the covariance matrix (5,000 x 5,000), and gives you the same principal components. The first few PCs typically correspond to market, sector, and style factors.