Deriving the First Principal Component

Linear Algebra · Medium · Free problem

You have a centered data matrix $X$ of size $n \times p$ (rows are observations, columns are features), and the sample covariance matrix is $\Sigma = \frac{1}{n} X^T X$.

You want to find the direction in $\mathbb{R}^p$ along which the data has the most spread. Formally: find the unit vector $\mathbf{w}$ (with $\mathbf{w}^T \mathbf{w} = 1$) that maximizes the variance of the projected data:

$\text{Var}(X\mathbf{w}) = \mathbf{w}^T \Sigma \mathbf{w}$

  1. Set up and solve the constrained optimization problem to find $\mathbf{w}$.
  2. Show that the maximum variance equals a specific eigenvalue of $\Sigma$, and identify which one.
  3. How would you extend this to find the second principal component? What fraction of total variance does the $k$-th component explain?

Hints

  1. You are maximizing a quadratic form $\mathbf{w}^T \Sigma \mathbf{w}$ on the unit sphere -- think constrained optimization.
  2. Write down the Lagrangian and differentiate. The first-order condition should remind you of a familiar matrix equation.
  3. Once you see the eigenvalue equation $\Sigma \mathbf{w} = \lambda \mathbf{w}$, evaluate the objective at the solution to determine which eigenvector wins.

Worked Solution

How to Think About It: PCA is just asking: "if I had to summarize all $p$ features with a single number per observation, what linear combination captures the most information?" Information here means variance -- the direction where the data is most spread out. This is a constrained optimization problem you can crack with Lagrange multipliers, and the punchline is beautiful: the answer falls out of the eigendecomposition of the covariance matrix. Every quant should be able to derive this from scratch in under two minutes.

Quick Estimate: Before any math, think geometrically. If $\Sigma$ is diagonal with entries $(5, 2, 1)$, the data is most spread along the first coordinate axis, so $\mathbf{w}_1 = (1, 0, 0)$ and the variance captured is 5 (the largest diagonal entry). For a general $\Sigma$, the eigenvectors rotate the axes to align with the directions of maximum spread, and the eigenvalues tell you how much spread each direction captures. So the answer must involve eigenvectors and eigenvalues.

Approach: Lagrange multipliers on the unit-norm constraint, then read off the eigenvalue equation.

Formal Solution:

Part 1 -- Finding the optimal direction:

We want to maximize $\mathbf{w}^T \Sigma \mathbf{w}$ subject to $\mathbf{w}^T \mathbf{w} = 1$. Write the Lagrangian:

$\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T \Sigma \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)$

Take the gradient with respect to $\mathbf{w}$ and set it to zero:

$\nabla_{\mathbf{w}} \mathcal{L} = 2\Sigma \mathbf{w} - 2\lambda \mathbf{w} = 0$

This gives:

$\Sigma \mathbf{w} = \lambda \mathbf{w}$

This is the eigenvalue equation for $\Sigma$. So the optimal $\mathbf{w}$ must be an eigenvector of $\Sigma$ with eigenvalue $\lambda$.

Part 2 -- Which eigenvector?

At any eigenvector solution, the objective value is:

$\mathbf{w}^T \Sigma \mathbf{w} = \mathbf{w}^T (\lambda \mathbf{w}) = \lambda \|\mathbf{w}\|^2 = \lambda$

So the variance captured equals the eigenvalue. To maximize, pick the eigenvector corresponding to the largest eigenvalue $\lambda_1$. The first principal component direction is $\mathbf{w}_1$, the eigenvector of $\Sigma$ with the largest eigenvalue, and the variance along that direction is $\lambda_1$.

Part 3 -- Higher components and variance explained:

For the second principal component, maximize $\mathbf{w}^T \Sigma \mathbf{w}$ subject to $\|\mathbf{w}\| = 1$ and orthogonality to $\mathbf{w}_1$: $\mathbf{w}^T \mathbf{w}_1 = 0$. Adding a second Lagrange multiplier for the orthogonality constraint and working through the same calculation shows that $\mathbf{w}_2$ is the eigenvector with the second-largest eigenvalue $\lambda_2$. In general, the $k$-th principal component is the eigenvector for the $k$-th largest eigenvalue.

Since the total variance across all features is $\text{tr}(\Sigma) = \sum_{i=1}^{p} \lambda_i$, the fraction of variance explained by the $k$-th component is:

$\text{proportion}_k = \frac{\lambda_k}{\sum_{i=1}^{p} \lambda_i} = \frac{\lambda_k}{\text{tr}(\Sigma)}$

Answer: The first principal component direction is the eigenvector of $\Sigma$ corresponding to its largest eigenvalue $\lambda_1$. The projected data $X\mathbf{w}_1$ has variance $\lambda_1$, which is the maximum achievable over all unit vectors. The $k$-th component explains a fraction $\lambda_k / \text{tr}(\Sigma)$ of the total variance.

Intuition

PCA is really just rotating your coordinate system so the axes line up with the directions of greatest variability. The covariance matrix $\Sigma$ encodes all the second-order structure of your data -- how features co-move. Its eigenvectors are the natural axes of the data cloud, and the eigenvalues tell you how stretched the cloud is along each axis. Maximizing projected variance with a unit-norm constraint is the simplest constrained optimization problem you can write down, and the fact that it reduces to an eigenvalue problem is one of the cleanest results in all of linear algebra.

In practice, this shows up everywhere in quant finance: reducing a universe of 500 correlated stock returns to a handful of principal portfolios, compressing a yield curve into level/slope/curvature factors, or building orthogonal risk factors for a multi-factor model. The key subtlety people miss is that PCA is purely a variance-maximizing decomposition -- it does not care about prediction or economic meaning. The first PC of stock returns is usually the market factor, but that is an empirical fact, not a guarantee. And because PCA is based on the sample covariance, it is sensitive to outliers and non-stationarity, which is why practitioners often use robust covariance estimators or rolling windows.

Open the full interactive solver →