Problems With L0 Regularization
In regression, we sometimes want to penalize the number of non-zero coefficients directly. The $L_0$ "norm" of a coefficient vector $\beta$ counts how many entries are non-zero:
$\|\beta\|_0 = \#\{j : \beta_j \neq 0\}$
The $L_0$-regularized regression problem is:
$\min_{\beta} \|Y - X\beta\|_2^2 + \lambda \|\beta\|_0$
1. Why is this problem fundamentally harder to solve than $L_1$ or $L_2$ regularized regression?
2. What specific mathematical properties make $L_0$ regularization intractable for standard optimization algorithms?
3. Why is $L_1$ (Lasso) used as a practical substitute, and in what sense is it the "best" convex substitute for $L_0$?
Open the full interactive solver, hints, and worked solution →