OLS Estimation and Multicollinearity
Consider a linear regression model where you regress a response $y$ on a vector of predictors $x$:
$y = X\beta + \epsilon$
where $X$ is an $n \times p$ design matrix, $\beta$ is the coefficient vector, and $\epsilon$ is a noise vector with the usual OLS assumptions (zero mean, constant variance, uncorrelated errors).
- Derive the ordinary least squares (OLS) estimator $\hat{\beta}$.
- For the simple regression case (one predictor, no intercept), express $\hat{\beta}$ in terms of the correlation $\rho_{xy}$ and the standard deviations $\sigma_x$, $\sigma_y$.
- Now suppose $X$ has multiple columns that are highly correlated with each other. What happens to the OLS estimates? Why do the coefficient estimates become unreliable, and how can you diagnose and fix the problem?
Hints
- Start from the objective: minimize $\|y - X\beta\|^2$ and take the derivative with respect to $\beta$ to get the normal equations.
- For the simple regression case, the key identity is $\hat{\beta} = \text{Cov}(x,y)/\text{Var}(x)$. Think about what happens when you standardize both variables.
- For multicollinearity, examine what happens to $(X^TX)^{-1}$ when $X^TX$ has a near-zero eigenvalue. The Variance Inflation Factor /(1 - R_j^2)$ quantifies the damage to each coefficient's precision.
Worked Solution
How to Think About It: OLS is just projection -- you are projecting $y$ onto the column space of $X$ to find the linear combination that gets closest in squared distance. The formula drops out of a first-order condition. The interesting part is what happens when the columns of $X$ are nearly linearly dependent: the projection is still fine (predicted values are stable), but the decomposition of that projection into individual coefficient contributions becomes wildly unstable. This is the core of multicollinearity -- the prediction works, but the attribution breaks.
Quick Estimate: To build intuition, think about two predictors with correlation $\rho = 0.99$. The variance of each coefficient is inflated by a factor of $\text{VIF} = 1/(1 - \rho^2) = 1/(1 - 0.9801) \approx 50$. So a coefficient that would have a standard error of 0.1 with uncorrelated predictors now has a standard error of $0.1 \times \sqrt{50} \approx 0.7$. The estimate is essentially useless for interpretation, even though $R^2$ might be high.
Formal Solution:
*Part 1: Deriving the OLS estimator*
Minimize the sum of squared residuals:
$\min_{\beta} \|y - X\beta\|^2 = (y - X\beta)^T(y - X\beta)$
Expand and take the derivative with respect to $\beta$:
$\frac{\partial}{\partial \beta} \left[ y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta \right] = -2X^Ty + 2X^TX\beta = 0$
This gives the normal equations:
$X^TX\hat{\beta} = X^Ty$
Provided $X^TX$ is invertible:
$\hat{\beta} = (X^TX)^{-1}X^Ty$
The covariance matrix of $\hat{\beta}$ is $\text{Var}(\hat{\beta}) = \sigma^2 (X^TX)^{-1}$, so the precision of the estimates depends entirely on the structure of $X^TX$.
*Part 2: Simple regression case*
With one predictor and no intercept, $X = x$ is a column vector. Then:
$\hat{\beta} = \frac{x^Ty}{x^Tx} = \frac{\sum x_i y_i}{\sum x_i^2}$
If $x$ and $y$ are centered (zero mean), this simplifies to:
$\hat{\beta} = \frac{\text{Cov}(x, y)}{\text{Var}(x)} = \rho_{xy} \cdot \frac{\sigma_y}{\sigma_x}$
This is a clean result: the regression coefficient equals the correlation scaled by the ratio of standard deviations. When $x$ and $y$ are standardized (both have unit variance), $\hat{\beta} = \rho_{xy}$ directly.
*Part 3: Multicollinearity*
When columns of $X$ are highly correlated, $X^TX$ becomes ill-conditioned (its smallest eigenvalue approaches zero). Here is why this matters:
- Unstable coefficients: $(X^TX)^{-1}$ has entries that scale like /\lambda_{\min}$, where $\lambda_{\min}$ is the smallest eigenvalue. A near-zero eigenvalue means the inverse blows up, so tiny perturbations in $y$ cause large swings in $\hat{\beta}$.
- Inflated variance: The variance of the $j$-th coefficient is $\sigma^2 [(X^TX)^{-1}]_{jj}$. The Variance Inflation Factor quantifies this:
$\text{VIF}_j = \frac{1}{1 - R_j^2}$
where $R_j^2$ is the $R^2$ from regressing $x_j$ on all other predictors. A VIF above 5-10 signals trouble.
- Symptoms: The overall model fits well ($R^2$ is high), but individual coefficients have large standard errors and flipping signs. You cannot tell which predictor is doing the work.
- Diagnosis: Check VIFs, examine the eigenvalue spectrum of $X^TX$, or compute condition numbers.
- Remedies: 1. Ridge regression: Add $\lambda I$ to $X^TX$, giving $\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty$. This biases coefficients toward zero but dramatically reduces variance. 2. PCA regression: Project $X$ onto its top principal components, discarding directions with near-zero variance. 3. Drop redundant features: If two predictors carry nearly the same information, keep one.
Answer: The OLS estimator is $\hat{\beta} = (X^TX)^{-1}X^Ty$. In simple regression, $\hat{\beta} = \rho_{xy} \cdot \sigma_y / \sigma_x$. When predictors are highly correlated, $X^TX$ becomes ill-conditioned, inflating coefficient variance by a factor of $\text{VIF}_j = 1/(1 - R_j^2)$. The model still predicts well but individual coefficients are unreliable. The standard fix is ridge regression, which trades a small amount of bias for a large reduction in variance.
Intuition
The deep lesson here is the difference between prediction stability and coefficient stability. OLS finds the best linear combination of your predictors, and that combination is robust -- even with correlated predictors, $\hat{y} = X\hat{\beta}$ barely changes if you perturb the data. But the decomposition of that prediction into individual contributions ($\beta_1 x_1 + \beta_2 x_2 + \ldots$) is fragile when the predictors are collinear, because there are many nearly-equivalent ways to split the credit. This is exactly the situation you face in factor modeling: you might have momentum, short-term reversal, and some flow signal that are all 90% correlated. Your portfolio forecast is fine, but if you try to attribute P&L to individual factors, the attribution is noise.
This is also why ridge regression is so natural in quant finance. You are rarely interested in unbiased coefficient estimates for their own sake -- you want stable predictions and stable portfolio weights. Shrinking coefficients toward zero is just Bayesian regularization with a Gaussian prior, and in high-dimensional or correlated settings, the bias-variance tradeoff almost always favors some shrinkage.
- Unstable coefficients: $(X^TX)^{-1}$ has entries that scale like