OLS and Ridge Estimators in a Linear Factor Model

Regression · Medium · Free problem

You are estimating a linear factor model for $N$ assets with $k$ factors over $T$ time periods. The model is:

$r_t = \alpha + B f_t + \varepsilon_t$

where $r_t \in \mathbb{R}^N$ is the vector of asset returns at time $t$, $f_t \in \mathbb{R}^k$ is the vector of factor returns, $B \in \mathbb{R}^{N \times k}$ is the loading matrix, $\alpha \in \mathbb{R}^N$ is the intercept vector, and $\varepsilon_t$ is the residual.

Write the OLS estimator $\hat{B}$ in matrix form. Be explicit about how the data matrices are constructed.

Define the in-sample $R^2$ per asset. Explain why a high in-sample $R^2$ does not guarantee good out-of-sample performance.

Propose a ridge regression alternative. Write the ridge estimator $\hat{B}_{\text{ridge}}$ and explain how the regularization parameter $\lambda$ affects the bias-variance tradeoff.

Hints

Start by stacking the time-series data into matrices $Y$ and $X$, then recall the standard multivariate OLS formula $(X^\top X)^{-1} X^\top Y$.
For the $R^2$ discussion, think about what happens when $k$ grows relative to $T$ -- how many degrees of freedom does OLS consume, and what does that mean for the residuals?
The ridge estimator modifies the Gram matrix by adding $\lambda I_k$ to the diagonal. Think about how this changes the eigenvalues of $F^\top F$ and what that does to the variance of the estimator in the near-singular directions.

Worked Solution

How to Think About It: This is a bread-and-butter factor modeling question. In practice, you are fitting something like a Fama-French or PCA-based factor model to a cross-section of assets. The OLS part is mechanical -- stack the data, invert the Gram matrix. The interesting part is why OLS falls apart out of sample: with many factors and finite data, OLS overfits the noise in the factor loadings. Ridge regression is the standard fix -- you shrink the loadings toward zero, trading a small bias for a big reduction in estimation variance. A quant researcher does this daily.

Part (a): OLS Estimator

Stack the data across $T$ observations. Let $Y \in \mathbb{R}^{T \times N}$ be the matrix of returns, where row $t$ is $r_t^\top$. Let $X \in \mathbb{R}^{T \times (k+1)}$ be the design matrix with a column of ones (for the intercept) followed by the factor returns, so row $t$ is $[1, f_t^\top]$. Define $\Theta = [\alpha, B]^\top \in \mathbb{R}^{(k+1) \times N}$ as the stacked coefficient matrix.

The OLS estimator minimizes the sum of squared residuals:

$\hat{\Theta}_{\text{OLS}} = (X^\top X)^{-1} X^\top Y$

The loading matrix $\hat{B}$ is the last $k$ rows of $\hat{\Theta}_{\text{OLS}}$. If you demean both $r_t$ and $f_t$ first (subtract their time-series means), you can drop the intercept column and write:

$\hat{B} = \left(\sum_{t=1}^T f_t f_t^\top \right)^{-1} \left(\sum_{t=1}^T f_t r_t^\top \right) = (F^\top F)^{-1} F^\top R$

where $F \in \mathbb{R}^{T \times k}$ has demeaned factor returns and $R \in \mathbb{R}^{T \times N}$ has demeaned asset returns.

Part (b): In-Sample $R^2$ and Overfitting

For asset $i$, the in-sample $R^2$ is:

$R_i^2 = 1 - \frac{\sum_{t=1}^T (r_{i,t} - \hat{\alpha}_i - \hat{B}_i f_t)^2}{\sum_{t=1}^T (r_{i,t} - \bar{r}_i)^2} = 1 - \frac{\text{RSS}_i}{\text{TSS}_i}$

where $\hat{B}_i$ is the $i$-th row of $\hat{B}$.

High in-sample $R^2$ does not guarantee out-of-sample performance for several reasons:

Overfitting: OLS minimizes the in-sample residuals by construction. With $k$ factors and $T$ observations, the effective degrees of freedom consumed are $k+1$ per asset. If $k$ is large relative to $T$, the model fits noise in the factor-return covariance structure rather than true signal. In the extreme case $k \geq T-1$, you get $R^2 = 1$ trivially -- the fit is perfect but meaningless.

Spurious correlations: With many candidate factors, some will correlate with a particular asset's returns by chance. OLS has no mechanism to distinguish real from spurious loadings.

Instability of $\hat{B}$: When $F^\top F$ is near-singular (collinear factors), the OLS estimates have huge variance. Small perturbations in the data produce wildly different $\hat{B}$, so the in-sample fit does not generalize.

The adjusted $R^2$ penalizes for the number of factors but does not fully solve the problem. True out-of-sample testing (time-series cross-validation) is needed.

Part (c): Ridge Estimator and Bias-Variance Tradeoff

Ridge regression adds an $\ell_2$ penalty on the loadings. The ridge estimator (for demeaned data) is:

$\hat{B}_{\text{ridge}} = (F^\top F + \lambda I_k)^{-1} F^\top R$

where $\lambda > 0$ is the regularization parameter and $I_k$ is the $k \times k$ identity matrix.

How $\lambda$ affects bias and variance:

Let $F^\top F = V \Sigma^2 V^\top$ be the eigendecomposition of the Gram matrix, with eigenvalues $\sigma_1^2 \geq \cdots \geq \sigma_k^2$. The OLS estimator amplifies directions with small eigenvalues (nearly collinear factor combinations), producing high variance in those directions. Ridge shrinks the contribution of the $j$-th eigenvector by a factor of $\sigma_j^2 / (\sigma_j^2 + \lambda)$, so:

Variance decreases: Directions with small $\sigma_j^2$ (unstable directions) get heavily shrunk. The overall estimation variance falls because the estimator no longer chases noise in the ill-conditioned directions.

Bias increases: Ridge shrinks all loadings toward zero, so $E[\hat{B}_{\text{ridge}}] \neq B$. The bias for the $j$-th component is proportional to $\lambda / (\sigma_j^2 + \lambda)$. Loadings on strong factors (large $\sigma_j^2$) are barely biased, while loadings on weak factors are heavily biased.

Tradeoff: For small $\lambda$, ridge behaves like OLS (low bias, high variance). For large $\lambda$, all loadings shrink toward zero (high bias, low variance). The optimal $\lambda$ minimizes the out-of-sample mean squared error, typically chosen by cross-validation. In practice, moderate shrinkage almost always improves out-of-sample factor model performance because the variance reduction outweighs the bias.

Answer:

OLS: $\hat{B} = (F^\top F)^{-1} F^\top R$
Ridge: $\hat{B}_{\text{ridge}} = (F^\top F + \lambda I_k)^{-1} F^\top R$
In-sample $R^2$ overstates predictive ability because OLS overfits noise, especially with many factors relative to observations.
Ridge shrinks loadings toward zero, reducing estimation variance at the cost of bias. The shrinkage factor for eigenvector $j$ is $\sigma_j^2 / (\sigma_j^2 + \lambda)$, meaning weak/unstable directions are shrunk most aggressively -- exactly where OLS is most unreliable.

Intuition

This problem illustrates the fundamental tension in empirical factor modeling: you want enough factors to capture the cross-section of returns, but each additional factor costs you estimation precision. OLS is the maximum likelihood estimator under Gaussian errors, but it is also the maximum variance estimator in the class of linear shrinkage estimators. When the Gram matrix $F^\top F$ has small eigenvalues -- which happens whenever factors are correlated or the time series is short -- OLS blows up in those directions, fitting noise rather than signal.

Ridge regression is the simplest member of a family of shrinkage estimators (which also includes LASSO, elastic net, and Bayesian shrinkage priors) that trade a small, controlled bias for a large reduction in variance. In quant finance, this is not just a statistical nicety -- it is the difference between a factor model that works in backtest and one that works in production. The eigenvalue-by-eigenvalue view of ridge shrinkage ($\sigma_j^2 / (\sigma_j^2 + \lambda)$) is the key mental model: ridge automatically identifies the unstable directions in your factor space and damps them, leaving the well-estimated loadings nearly untouched.

Open the full interactive solver →