Ridge Regression Bias-Variance Tradeoff

Regression · Hard · Free problem

In ridge regression, the coefficient estimator is

$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^TY$

where $X$ is the $n \times p$ design matrix, $Y = X\beta + \varepsilon$ with $\varepsilon \sim N(0, \sigma^2 I)$, and $\lambda > 0$ is the regularization parameter.

Show that $\hat{\beta}_{\text{ridge}}$ is biased and derive an explicit expression for the bias as a function of $\lambda$ and $\beta$.

Derive the variance of $\hat{\beta}_{\text{ridge}}$ and show it is smaller than the OLS variance.

Explain qualitatively how the MSE of the ridge estimator behaves as $\lambda$ increases from $0$ to $\infty$. Why does an optimal $\lambda > 0$ often exist?

Hints

Start by substituting $Y = X\beta + \varepsilon$ into the ridge formula and taking expectations. The bias comes from the fact that a certain matrix is not the identity.
Factor the bias as $(W - I)\beta$ where $W = (X^TX + \lambda I)^{-1}X^TX$, and simplify $W - I$ by adding and subtracting $\lambda I$ in the numerator.
For the variance comparison, diagonalize via the eigendecomposition of $X^TX$. The $j$-th eigenvalue component has ridge variance $\sigma^2 d_j/(d_j + \lambda)^2$ versus OLS variance $\sigma^2/d_j$.

Worked Solution

How to Think About It: Ridge regression is the workhorse of regularized estimation. The core idea: OLS is unbiased but can blow up when features are correlated or $p$ is close to $n$ -- the $(X^TX)^{-1}$ matrix magnifies noise along directions with small eigenvalues. Ridge adds $\lambda I$ to stabilize the inverse. You are deliberately biasing the estimator toward zero in exchange for dramatically lower variance. The question is whether that trade is worth it, and the answer is almost always yes when you have collinear or high-dimensional features.

Before diving into algebra, think about what happens at the extremes: $\lambda = 0$ gives you OLS (unbiased, potentially huge variance), and $\lambda \to \infty$ gives you $\hat{\beta} \to 0$ (maximum bias, zero variance). Somewhere in between, the MSE is minimized.

Quick Estimate: Suppose $X^TX = \text{diag}(d_1, \ldots, d_p)$ with eigenvalues. The $j$-th component of the ridge estimator has bias $-\lambda \beta_j / (d_j + \lambda)$ and variance $\sigma^2 d_j / (d_j + \lambda)^2$. For a small eigenvalue $d_j = 0.1$ with $\beta_j = 1$, $\sigma^2 = 1$, and $\lambda = 1$: the OLS variance for that component is

/0.1 = 10$, while the ridge variance is $0.1/1.21 \approx 0.08$. The ridge bias is $-1/1.1 \approx -0.91$, giving squared bias $\approx 0.83$. The ridge MSE is $0.08 + 0.83 = 0.91$, versus OLS MSE of $0 + 10 = 10$. Ridge wins by an order of magnitude on this component. That is the whole story -- for directions where the signal-to-noise is poor, the variance savings dwarf the bias cost.

Approach: Substitute $Y = X\beta + \varepsilon$ into the ridge formula, take expectations for the bias, and compute the covariance matrix for the variance.

Formal Solution:

*Part 1: Bias*

Substitute $Y = X\beta + \varepsilon$:

$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^T(X\beta + \varepsilon) = (X^TX + \lambda I)^{-1}X^TX\beta + (X^TX + \lambda I)^{-1}X^T\varepsilon$

Taking expectations (since $E[\varepsilon] = 0$):

$E[\hat{\beta}_{\text{ridge}}] = (X^TX + \lambda I)^{-1}X^TX \beta = W\beta$

where $W = (X^TX + \lambda I)^{-1}X^TX$. For $\lambda > 0$, $W \neq I$, so the estimator is biased. The bias is:

$\text{Bias}(\hat{\beta}_{\text{ridge}}) = E[\hat{\beta}_{\text{ridge}}] - \beta = (W - I)\beta$

To simplify, note that $W - I = (X^TX + \lambda I)^{-1}(X^TX - X^TX - \lambda I) = -\lambda(X^TX + \lambda I)^{-1}$. Therefore:

$\text{Bias}(\hat{\beta}_{\text{ridge}}) = -\lambda(X^TX + \lambda I)^{-1}\beta$

This makes the shrinkage explicit: each coefficient gets pulled toward zero, and the pull is stronger for directions corresponding to small eigenvalues of $X^TX$.

*Part 2: Variance*

The random part of $\hat{\beta}_{\text{ridge}}$ is $(X^TX + \lambda I)^{-1}X^T\varepsilon$. Its covariance:

$\text{Var}(\hat{\beta}_{\text{ridge}}) = \sigma^2 (X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1}$

Using the SVD or eigendecomposition $X^TX = Q \Lambda Q^T$ with eigenvalues $d_1, \ldots, d_p$, this diagonalizes to:

$\text{Var}(\hat{\beta}_{\text{ridge}})_{jj} = \frac{\sigma^2 d_j}{(d_j + \lambda)^2}$

Compare to OLS: $\text{Var}(\hat{\beta}_{\text{OLS}})_{jj} = \sigma^2 / d_j$. Since:

$\frac{d_j}{(d_j + \lambda)^2} < \frac{1}{d_j} \quad \text{for all } \lambda > 0$

the ridge variance is strictly smaller in every direction.

*Part 3: The Tradeoff*

The MSE of the $j$-th component is:

$\text{MSE}_j = \underbrace{\frac{\lambda^2 \beta_j^2}{(d_j + \lambda)^2}}_{\text{Bias}^2} + \underbrace{\frac{\sigma^2 d_j}{(d_j + \lambda)^2}}_{\text{Variance}}$

As $\lambda$ increases from $0$: - Bias$^2$ increases monotonically (from $0$ to $\beta_j^2$) - Variance decreases monotonically (from $\sigma^2/d_j$ to $0$) - Total MSE first decreases (variance drops faster than bias grows) and then increases (bias dominates)

This U-shaped MSE curve guarantees that an optimal $\lambda^{*} > 0$ exists whenever $\beta \neq 0$ and $\sigma^2 > 0$. In practice, $\lambda^{*}$ is found by cross-validation.

Answer: The ridge estimator has bias $-\lambda(X^TX + \lambda I)^{-1}\beta$, which shrinks coefficients toward zero. Its variance $\sigma^2(X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1}$ is strictly less than the OLS variance $\sigma^2(X^TX)^{-1}$. The total MSE is minimized at some $\lambda^{*} > 0$ because the variance reduction initially outweighs the bias increase. The effect is largest for directions with small eigenvalues of $X^TX$ -- precisely where OLS is most unstable.

Intuition

Ridge regression is probably the single most important regularization technique to internalize for quant work. The core lesson is that unbiasedness is overrated. OLS is unbiased, but when features are correlated or the design matrix is nearly rank-deficient, the variance can be enormous -- and in practice, variance is what kills your out-of-sample predictions. Ridge trades a controlled amount of bias (shrinking toward zero) for a potentially massive reduction in variance. The key insight from the eigenvalue decomposition is that ridge helps most where you need it most: directions with small eigenvalues are exactly the ones where OLS variance explodes, and those are the directions that ridge shrinks most aggressively.

This tradeoff appears everywhere in quantitative finance. When you fit a multi-factor return model and have correlated factors (say, value and quality), OLS can give you wildly unstable loadings that flip sign from one month to the next. Ridge (or equivalently, adding a Gaussian prior on coefficients in the Bayesian view) stabilizes those estimates. The Bayesian interpretation is clean: ridge is MAP estimation under a $N(0, \sigma^2/\lambda)$ prior on each coefficient. Choosing $\lambda$ by cross-validation is choosing how skeptical you are that any single factor matters. In practice, nearly every production portfolio optimization or risk model uses some form of shrinkage -- ridge, Ledoit-Wolf covariance shrinkage, or James-Stein estimation are all manifestations of the same principle.

Open the full interactive solver →