L1 vs. L2 Regularization

Regression · Medium · Free problem

You are building a linear regression model with many features.

  1. How do $L_1$ (Lasso) and $L_2$ (Ridge) regularization differ in their effect on the estimated coefficients?
  1. Why does $L_1$ produce exactly zero coefficients while $L_2$ does not? Give both a geometric and a Bayesian explanation.
  1. Why might we prefer models with small coefficient values, even if larger coefficients fit the training data better?

Hints

  1. Think about the shape of the constraint regions: the $L_1$ ball is a diamond with corners, while the $L_2$ ball is a smooth sphere. Where are the contour lines of the quadratic loss most likely to touch each shape?
  2. In the Bayesian interpretation, $L_2$ corresponds to a Gaussian prior (smooth, no mass at zero) and $L_1$ corresponds to a Laplace prior (sharp peak at zero). The MAP estimate under each prior gives Ridge and Lasso respectively.
  3. For the preference for small coefficients, think about the bias-variance tradeoff: large coefficients fit noise, and a small increase in bias from shrinkage can produce a large decrease in variance.

Worked Solution

How to Think About It: Both $L_1$ and $L_2$ regularization add a penalty to the loss function to prevent overfitting, but they do it in fundamentally different ways. Ridge shrinks everything proportionally -- big coefficients get shrunk a lot, small ones get shrunk a little, but nothing goes to exactly zero. Lasso, on the other hand, is a feature selector: it drives some coefficients all the way to zero, effectively removing those features from the model. The reason comes down to geometry -- the $L_1$ ball has corners, and the corners are where coordinates are zero.

Key Insight: The shape of the constraint region determines the solution. The $L_1$ ball is a diamond with corners on the axes; the $L_2$ ball is a smooth sphere. The loss function's contours are more likely to first touch a corner (sparse solution) than a smooth surface (dense solution).

The Method:

(1) Effect on coefficients:

$L_2$ (Ridge): Minimizes $\|y - X\beta\|^2 + \lambda \sum_j \beta_j^2$. The closed-form solution is:

$\hat{\beta}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$

This shrinks all coefficients toward zero proportionally. If the OLS solution has $\hat{\beta}_j^{\text{OLS}} = 5$, ridge might shrink it to 3, but never to exactly 0. The shrinkage factor for each coefficient (in the orthogonal design case) is $d_j^2 / (d_j^2 + \lambda)$, where $d_j$ is the $j$-th singular value.

$L_1$ (Lasso): Minimizes $\|y - X\beta\|^2 + \lambda \sum_j |\beta_j|$. No closed-form in general, but in the orthogonal case:

$\hat{\beta}_{\text{lasso},j} = \text{sign}(\hat{\beta}_j^{\text{OLS}}) \max(|\hat{\beta}_j^{\text{OLS}}| - \lambda/2, 0)$

This is "soft thresholding" -- coefficients smaller than $\lambda/2$ in absolute value are set exactly to zero.

(2) Why does $L_1$ produce zeros?

Geometric explanation: Think of the constrained optimization: minimize the loss subject to $\sum |\beta_j| \le t$. The feasible region is a diamond (in 2D) or cross-polytope (in higher dimensions). The loss contours are ellipses. As you expand the feasible region, the first point of contact between the ellipse and the diamond is much more likely to be at a corner -- and corners have one or more coordinates equal to zero. For the $L_2$ ball (a circle), the first contact is almost never at a point where any coordinate is exactly zero.

Bayesian explanation: $L_2$ corresponds to a Gaussian prior on each $\beta_j$: $\beta_j \sim N(0, 1/\lambda)$. The Gaussian density is smooth and has zero probability of being exactly zero. $L_1$ corresponds to a Laplace (double-exponential) prior: $p(\beta_j) \propto e^{-\lambda|\beta_j|}$. The Laplace prior has a sharp peak at zero, concentrating probability mass there. The MAP estimate under a Laplace prior is the Lasso solution, which can land exactly at zero.

(3) Why prefer small coefficients?

  • Bias-variance tradeoff: Large coefficients mean the model is fitting noise in the training data. Shrinking coefficients introduces a small bias but dramatically reduces variance, improving out-of-sample performance.
  • Condition number: When features are correlated, OLS coefficients can be huge and opposite-signed (they cancel in-sample but blow up out-of-sample). Regularization stabilizes these.
  • Interpretability: Small, stable coefficients are easier to understand, explain, and trust.
  • Robustness: A model with large coefficients is fragile -- small changes in input features cause large changes in predictions.

In finance specifically, large coefficients often indicate overfitting to in-sample quirks. A signal with coefficient 50 that works on training data is almost certainly not going to persist.

Answer: Ridge shrinks coefficients proportionally but never to zero; Lasso performs feature selection by setting some coefficients exactly to zero. The geometric reason is that the $L_1$ constraint set has corners at the axes. Small coefficients are preferred because they reduce overfitting, improve stability, and enhance out-of-sample generalization.

Intuition

The $L_1$ vs. $L_2$ distinction is one of the most frequently asked questions in quant interviews because it tests whether you understand regularization beyond just "it prevents overfitting." The deep insight is geometric: the corners of the $L_1$ ball make sparsity natural, while the smoothness of the $L_2$ ball makes sparsity almost impossible. This is not just a mathematical curiosity -- it determines whether your model does feature selection automatically or not.

In practice, the choice between Lasso and Ridge often comes down to whether you believe the true model is sparse (only a few features matter) or dense (many features contribute a little). In finance, the answer is usually "we do not know," which is why Elastic Net (a mix of $L_1$ and $L_2$) is popular. But the interview wants you to understand the mechanics: Ridge for stability when features are correlated, Lasso for selection when you suspect most features are irrelevant.

Open the full interactive solver →