Loss Function Minimizers and Regression Variants

Regression · Medium · Free problem

Answer the following questions about optimization and regression.

  1. What value of $A$ minimizes $\sum_{i=1}^{n} (x_i - A)^2$? Prove it.
  1. What value of $A$ minimizes $\sum_{i=1}^{n} |x_i - A|$? Why?
  1. Consider the linear model $Y = X\beta$ with squared-error loss $L(\beta) = \|Y - X\beta\|^2$.

(a) Derive the OLS estimator $\hat{\beta}$.

(b) What goes wrong if $X^T X$ is singular or ill-conditioned?

(c) What is the ridge regression estimator, and how does it help?

(d) How is the lasso estimator computed, and why does it not have a closed-form solution?

Hints

  1. For parts 1 and 2, think about which summary statistic minimizes each loss function. Squared loss and absolute loss have fundamentally different minimizers.
  2. For OLS, take the matrix gradient of $(Y - X\beta)^T(Y - X\beta)$ and set it to zero. Ridge regression just adds $\lambda I$ to $X^T X$.
  3. Lasso has no closed form because $|\beta_j|$ is not differentiable at zero. The solution involves a soft-thresholding operator applied coordinate-by-coordinate.

Worked Solution

How to Think About It: These four questions form a natural progression from the simplest optimization problem (minimize squared loss over a scalar) to the full machinery of regularized regression. The common thread is: the loss function you choose determines which summary statistic you recover. Squared loss gives you the mean, absolute loss gives you the median, and adding penalty terms to squared loss gives you shrunken versions of OLS that trade bias for stability.

Formal Solution:

Part 1: Minimizer of $L_2$ loss -- the mean.

Take the derivative and set it to zero:

$\frac{d}{dA} \sum_{i=1}^{n} (x_i - A)^2 = -2 \sum_{i=1}^{n} (x_i - A) = 0$

$\sum_{i=1}^{n} x_i = nA \implies A^{*} = \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$

The second derivative is

n > 0$, confirming this is a minimum. The minimizer of squared loss is the sample mean.

Part 2: Minimizer of $L_1$ loss -- the median.

The function $g(A) = \sum_{i=1}^{n} |x_i - A|$ is piecewise linear and convex. Its derivative (where it exists) is:

$g'(A) = -\#\{i : x_i > A\} + \#\{i : x_i < A\}$

This equals zero when the number of data points above $A$ equals the number below -- which is exactly the definition of the median. Formally, the subdifferential of $|x_i - A|$ with respect to $A$ is $-\text{sign}(x_i - A)$, and the optimality condition $0 \in \partial g(A)$ holds at the median.

Intuition: moving $A$ toward the side with more data points always decreases the total absolute deviation. The median is the balance point.

Part 3(a): OLS estimator.

Expand the loss:

$L(\beta) = (Y - X\beta)^T(Y - X\beta)$

Take the gradient and set it to zero:

$\nabla_\beta L = -2X^T(Y - X\beta) = 0 \implies X^T X \beta = X^T Y$

If $X^T X$ is invertible:

$\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T Y$

Part 3(b): When $X^T X$ is singular or ill-conditioned.

If $X^T X$ is singular (rank-deficient), the system $X^T X \beta = X^T Y$ has infinitely many solutions -- the OLS estimator is not unique. This happens when $p > n$ (more features than observations) or when features are perfectly collinear.

If $X^T X$ is nearly singular (ill-conditioned, with condition number $\kappa \gg 1$), the estimator is technically unique but wildly unstable: tiny perturbations in $Y$ cause huge swings in $\hat{\beta}$. The coefficients become unreliable and often have inflated variance.

Part 3(c): Ridge regression.

Ridge regression adds an $L_2$ penalty to the loss:

$L_{\text{ridge}}(\beta) = \|Y - X\beta\|^2 + \lambda \|\beta\|^2$

Taking the gradient and setting it to zero:

$-2X^T(Y - X\beta) + 2\lambda\beta = 0 \implies (X^T X + \lambda I)\beta = X^T Y$

$\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T Y$

The matrix $X^T X + \lambda I$ is always invertible for $\lambda > 0$ (it shifts all eigenvalues up by $\lambda$), so the ill-conditioning problem is resolved. The price is bias: ridge shrinks coefficients toward zero, trading increased bias for reduced variance.

Part 3(d): Lasso regression.

Lasso uses an $L_1$ penalty:

$L_{\text{lasso}}(\beta) = \|Y - X\beta\|^2 + \lambda \|\beta\|_1$

The $L_1$ penalty is not differentiable at $\beta_j = 0$, so there is no closed-form solution. The subgradient optimality condition for each coordinate is:

$-2X_j^T(Y - X\beta) + \lambda \cdot \text{sign}(\beta_j) \ni 0$

This leads to a soft-thresholding operation on each coordinate. The standard algorithms are:

The key property of lasso is sparsity: for large enough $\lambda$, many coefficients are set exactly to zero, performing automatic feature selection.

Answer:

1. $A^{*} = \bar{x}$ (sample mean) -- minimizer of $L_2$ loss. 2. $A^{*} = \text{median}(x_1, \ldots, x_n)$ -- minimizer of $L_1$ loss. 3. (a) $\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T Y$. (b) Singular $X^T X$ gives non-unique solutions; ill-conditioned $X^T X$ gives unstable solutions. (c) $\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T Y$ -- guaranteed invertible, trades bias for stability. (d) $\hat{\beta}_{\text{lasso}}$ has no closed form due to the non-differentiable $L_1$ penalty; solved via coordinate descent or LARS, and produces sparse solutions.

Intuition

This sequence of problems illustrates a deep principle: the loss function you choose determines the estimator you get, and each choice embeds different assumptions about your data. Squared loss produces the mean, which is optimal under Gaussian noise but sensitive to outliers. Absolute loss produces the median, which is robust but harder to optimize. OLS is the multivariate extension of mean estimation, and its vulnerabilities (singularity, instability) mirror the problems of estimating a mean with too little data or too much collinearity.

Ridge and lasso are the two canonical regularization strategies, and their difference comes down to geometry: the $L_2$ ball is round (shrinks all coefficients smoothly), while the $L_1$ ball has corners (drives coefficients exactly to zero). This is why lasso performs feature selection and ridge does not. In practice, most quant modeling pipelines use elastic net (a blend of both) to get the benefits of sparsity with the stability of ridge.

Open the full interactive solver →