Regression with More Features Than Observations

Regression · Medium · Free problem

In many practical settings -- genomics, NLP, factor models with hundreds of signals -- you end up with more features than observations ($d > n$).

  1. Explain precisely why ordinary least squares (OLS) breaks down when $d > n$. What goes wrong with the normal equations $\hat{\beta} = (X^\top X)^{-1} X^\top y$?
  1. Even if you could find a solution, why would it be useless for prediction?

3. Describe how each of the following fixes the problem, and explain the trade-offs: - Ridge regression (L2 penalty) - Lasso (L1 penalty) - PCA regression (dimensionality reduction) - Elastic Net

Hints

  1. Think about the rank of $X^\top X$ when $d > n$. What does that imply about the existence of $(X^\top X)^{-1}$?
  2. A system with more unknowns than equations is underdetermined -- it has infinitely many solutions. How does adding a penalty term $\lambda \|\beta\|$ change the geometry of the solution space?
  3. Ridge adds $\lambda I$ to restore full rank; Lasso uses the L1 penalty's geometry (diamond constraint set) to push coefficients to exactly zero. Think about why the corners of an L1 ball encourage sparsity while the smooth L2 ball does not.

Worked Solution

How to Think About It: The core issue is that when you have more knobs to turn ($d$ coefficients) than data points ($n$ equations), the system $X\beta = y$ is underdetermined. There are infinitely many $\beta$ vectors that fit the data perfectly -- and "perfectly" is the problem. You are fitting noise, not signal. In quant finance, this comes up constantly: you might screen 500 potential alpha signals across 200 trading days. Without regularization you will get a backtest that looks incredible and a live strategy that bleeds money.

Key Insight: The matrix $X^\top X$ is $d \times d$ but has rank at most $\min(n, d) = n < d$, so it is singular. No inverse exists, and OLS has no unique solution. The fix, in every case, is to constrain the solution space -- either by shrinking coefficients, forcing sparsity, or reducing the dimension of $X$ before fitting.

The Method:

*Why OLS breaks:*

  1. Singular normal equations. $X^\top X$ has rank at most $n$, so $(X^\top X)^{-1}$ does not exist. The loss surface has a $(d - n)$-dimensional flat valley of global minima, all with $R^2 = 1$ on the training set.
  1. Perfect in-sample fit, zero out-of-sample value. With $d > n$, you can interpolate the training data exactly. The model memorizes noise. Out-of-sample $R^2$ will typically be negative -- worse than just predicting the mean.

*How each method fixes it:*

1. Ridge regression (L2). Solve $\hat{\beta}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y.$ Adding $\lambda I$ makes the matrix full-rank and invertible for any $\lambda > 0$. All coefficients shrink toward zero, but none are set exactly to zero. Ridge is optimal when many features contribute small amounts -- common in dense factor models.

2. Lasso (L1). Solve $\min_{\beta} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1.$ The L1 penalty produces sparse solutions: many $\beta_j$ are driven to exactly zero. This performs automatic feature selection, which is useful when you believe only a handful of signals matter and the rest are noise.

  1. PCA regression. Compute the top $k$ principal components of $X$ (where $k < n$), project the data onto this lower-dimensional subspace, and run OLS in the reduced space. This sidesteps the rank problem entirely and handles multicollinearity well. The drawback: the principal components are linear combinations of all original features, so interpretability suffers.

4. Elastic Net. Combine L1 and L2: $\min_{\beta} \|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2.$ This gets you the sparsity of Lasso plus the stability of Ridge. It handles correlated features better than pure Lasso, which tends to arbitrarily pick one feature from a correlated group and drop the rest.

Practical Considerations:

  • Choosing $\lambda$: Use cross-validation, but be careful with time series data -- standard K-fold leaks future information. Walk-forward CV is the right choice in finance.
  • Standardization matters. Ridge and Lasso penalties are scale-dependent. Always standardize features before fitting, or you penalize large-scale features disproportionately.
  • Lasso instability. When features are highly correlated, Lasso's variable selection is unstable -- small changes in the data can flip which features survive. Elastic Net or stability selection helps.
  • Bias-variance trade-off. All regularization methods introduce bias to reduce variance. The right amount of bias depends on how noisy your data is, and in finance, it is almost always very noisy.

Answer: OLS fails when $d > n$ because $X^\top X$ is singular (rank $\leq n < d$), yielding infinitely many solutions that overfit perfectly. Ridge regression restores invertibility by adding $\lambda I$; Lasso enforces sparsity via L1 penalty for automatic feature selection; PCA regression reduces dimensionality below $n$; and Elastic Net combines L1 and L2 to get sparsity with stability. In practice, choose based on whether you believe the true signal is dense (Ridge), sparse (Lasso), or sparse with correlated groups (Elastic Net).

Intuition

The fundamental issue is degrees of freedom: when you have more parameters than data points, the model can memorize the training set perfectly, which tells you nothing about future data. This is not an abstract concern -- it is the default situation in modern quant research, where you might test hundreds of signals on a few years of daily returns. Every regularization method works by trading some in-sample fit for out-of-sample generalizability, either by shrinking coefficients (Ridge), eliminating them (Lasso), or compressing the feature space (PCA).

The deeper lesson is that the choice of regularizer encodes your prior belief about the structure of the true signal. Ridge says "all features contribute a little." Lasso says "only a few features matter." Elastic Net says "a few groups of correlated features matter." PCA says "the signal lives in a low-dimensional subspace of the feature space." Getting this choice right matters more than any amount of hyperparameter tuning -- if your structural assumption is wrong, no amount of cross-validation will save you.

Open the full interactive solver →